给 Python 算法插上性能的翅膀—

作者：jesonxiang（向乾彪），腾讯 TEG 后台开发工程师

1. 背景

目前 AI 算法开发特别是训练基本都以 Python 为主，主流的 AI 计算框架如 TensorFlow、PyTorch 等都提供了丰富的 Python 接口。有句话说得好，人生苦短，我用 Python。但由于 Python 属于动态语言，解释执行并缺少成熟的 JIT 方案，计算密集型场景多核并发受限等原因，很难直接满足较高性能要求的实时 Serving 需求。在一些对性能要求高的场景下，还是需要使用 C/C++来解决。但是如果要求算法同学全部使用 C++来开发线上推理服务，成本又非常高，导致开发效率和资源浪费。因此，如果有轻便的方法能将 Python 和部分 C++编写的核心代码结合起来，就能达到既保证开发效率又保证服务性能的效果。本文主要介绍 pybind11 在腾讯广告多媒体 AI Python 算法的加速实践，以及过程中的一些经验总结。

2. 业内方案

2.1 原生方案

Python 官方提供了 Python/C API，可以实现「用 C 语言编写 Python 库」，先上一段代码感受一下：

static PyObject *
spam_system(PyObject *self, PyObject *args)
{
    const char *command;
    int sts;

    if (!PyArg_ParseTuple(args, 's', &command))
        return NULL;
    sts = system(command);
    return PyLong_FromLong(sts);
}

可见改造成本非常高，所有的基本类型都必须手动改为 CPython 解释器封装的 binding 类型。由此不难理解，为何 Python 官网也建议大家使用第三方解决方案[1]。

2.2 Cython

Cython 主要打通的是 Python 和 C，方便为 Python 编写 C 扩展。Cython 的编译器支持转化 Python 代码为 C 代码，这些 C 代码可以调用 Python/C 的 API。从本质上来说，Cython 就是包含 C 数据类型的 Python。目前 Python 的 numpy，以及我厂的 tRPC-Python 框架有所应用。

缺点：

需要手动植入 Cython 自带语法（cdef 等），移植和复用成本高
需要增加其他文件，如 setup.py、*.pyx 来让你的 Python 代码最后能够转成性能较高的 C 代码
对于 C++的支持程度存疑

2.3 SIWG

SIWG 主要解决其他高级语言与 C 和 C++语言交互的问题，支持十几种编程语言，包括常见的 java、C#、javascript、Python 等。使用时需要用*.i 文件定义接口，然后用工具生成跨语言交互代码。但由于支持的语言众多，因此在 Python 端性能表现不是太好。

值得一提的是，TensorFlow 早期也是使用 SWIG 来封装 Python 接口，正式由于 SIWG 存在性能不够好、构建复杂、绑定代码晦涩难读等问题，TensorFlow 已于 2019 年将 SIWG 切换为 pybind11[2]。

2.4 Boost.Python

C++中广泛应用的 Boost 开源库，也提供了 Python binding 功能。使用上，通过宏定义和元编程来简化 Python 的 API 调用。但最大的缺点是需要依赖庞大的 Boost 库，编译和依赖关系包袱重，只用于解决 Python binding 的话有一种高射炮打蚊子的既视感。

2.5 pybind11

可以理解为以 Boost.Python 为蓝本，仅提供 Python & C++ binding 功能的精简版，相对于 Boost.Python 在 binary size 以及编译速度上有不少优势。对 C++支持非常好，基于 C++11 应用了各种新特性，也许 pybind11 的后缀 11 就是出于这个原因。

Pybind11 通过 C++ 编译时的自省来推断类型信息，来最大程度地减少传统拓展 Python 模块时繁杂的样板代码，且实现了常见数据类型，如 STL 数据结构、智能指针、类、函数重载、实例方法等到 Python 的自动转换，其中函数可以接收和返回自定义数据类型的值、指针或引用。

特点：

轻量且功能单一，聚焦于提供 C++ & Python binding，交互代码简洁
对常见的 C++数据类型如 STL、Python 库如 numpy 等兼容很好，无人工转换成本
only header 方式，无需额外代码生成，编译期即可完成绑定关系建立，减小 binary 大小
支持 C++新特性，对 C++的重载、继承，debug 方式便捷易用
完善的官方文档支持，应用于多个知名开源项目

“Talk is cheap, show me your code.” 三行代码即可快速实现绑定，你值得拥有：

PYBIND11_MODULE (libcppex, m) {
    m.def('add', [](int a, int b) -> int { return a + b; });
}

3. Python 调 C++

3.1 从 GIL 锁说起

GIL（Global Interpreter Lock）全局解释器锁：同一时刻在一个进程只允许一个线程使用解释器，导致多线程无法真正用到多核。由于持有锁的线程在执行到 I/O 密集函数等一些等待操作时会自动释放 GIL 锁，所以对于 I/O 密集型服务来说，多线程是有效果的。但对于 CPU 密集型操作，由于每次只能有一个线程真正执行计算，对性能的影响可想而知。

这里必须说明的是，GIL 并不是 Python 本身的缺陷，而是目前 Python 默认使用的 CPython 解析器引入的线程安全保护锁。我们一般说 Python 存在 GIL 锁，其实只针对于 CPython 解释器。那么如果我们能想办法避开 GIL 锁，是不是就能有很不错的加速效果？答案是肯定的，一种方案是改为使用其他解释器如 pypy 等，但对于成熟的 C 扩展库兼容不够好，维护成本高。另一种方案，就是通过 C/C++扩展来封装计算密集部分代码，并在执行时移除 GIL 锁。

3.2 Python 算法性能优化

pybind11 就提供了在 C++端手动释放 GIL 锁的接口，因此，我们只需要将密集计算的部分代码，改造成 C++代码，并在执行前后分别释放/获取 GIL 锁，Python 算法的多核计算能力就被解锁了。当然，除了显示调用接口释放 GIL 锁的方法之外，也可以在 C++内部将计算密集型代码切换到其他 C++线程异步执行，也同样可以规避 GIL 锁利用多核。

下面以 100 万次城市间球面距离计算为例，对比 C++扩展前后性能差异：

C++端：

#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pybind11/embed.h>


namespace py = pybind11;

const double pi = 3.1415926535897932384626433832795;

double rad(double d) {
    return d * pi / 180.0;
}

double geo_distance(double lon1, double lat1, double lon2, double lat2, int test_cnt) {
    py::gil_scoped_release release;     // 释放GIL锁

    double a, b, s;
    double distance = 0;

    for (int i = 0; i < test_cnt; i++) {
        double radLat1 = rad(lat1);
        double radLat2 = rad(lat2);
        a = radLat1 - radLat2;
        b = rad(lon1) - rad(lon2);
        s = pow(sin(a/2),2) + cos(radLat1) * cos(radLat2) * pow(sin(b/2),2);
        distance = 2 * asin(sqrt(s)) * 6378 * 1000;
    }

    py::gil_scoped_acquire acquire;     // C++执行结束前恢复GIL锁
    return distance;
}

PYBIND11_MODULE (libcppex, m) {
    m.def('geo_distance', &geo_distance, R'pbdoc(
        Compute geography distance between two places.
    )pbdoc');
}

Python 调用端：

import sys
import time
import math
import threading

from libcppex import *

def rad(d):
    return d * 3.1415926535897932384626433832795 / 180.0


def geo_distance_py(lon1, lat1, lon2, lat2, test_cnt):
    distance = 0

    for i in range(test_cnt):
        radLat1 = rad(lat1)
        radLat2 = rad(lat2)
        a = radLat1 - radLat2
        b = rad(lon1) - rad(lon2)
        s = math.sin(a/2)**2 + math.cos(radLat1) * math.cos(radLat2) * math.sin(b/2)**2
        distance = 2 * math.asin(math.sqrt(s)) * 6378 * 1000

    print(distance)
    return distance


def call_cpp_extension(lon1, lat1, lon2, lat2, test_cnt):
    res = geo_distance(lon1, lat1, lon2, lat2, test_cnt)
    print(res)
    return res


if __name__ == '__main__':
    threads = []
    test_cnt = 1000000
    test_type = sys.argv[1]
    thread_cnt = int(sys.argv[2])
    start_time = time.time()

    for i in range(thread_cnt):
        if test_type == 'p':
            t = threading.Thread(target=geo_distance_py,
                args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
        elif test_type == 'c':
            t = threading.Thread(target=call_cpp_extension,
                args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
        threads.append(t)
        t.start()

    for thread in threads:
        thread.join()

    print('calc time = %d' % int((time.time() - start_time) * 1000))

性能对比：

单线程时耗：Python 1500ms，C++ 8ms

10 线程时耗：Python 15152ms，C++ 16ms

CPU 利用率：

△ Python 多线程无法同时刻多核并行计算，仅相当于单核利用率

△ C++可以吃满 DevCloud 机器的 10 个 CPU 核

结论：

计算密集型代码，单纯改为 C++实现即可获得不错的性能提升，在多线程释放 GIL 锁的加持下，充分利用多核，性能轻松获得线性加速比，大幅提升资源利用率。虽然实际场景中也可以用 Python 多进程的方式来利用多核，但是在模型越来越大动辄数十 G 的趋势下，内存占用过大不说，进程间频繁切换的 context switching overhead，以及语言本身的性能差异，导致与 C++扩展方式依然有不少差距。

（注：以上测试 demo github 地址：
https://github.com/jesonxiang/cpp_extension_pybind11，测试环境为 CPU 10 核容器，大家有兴趣也可以做性能验证。）

3.3 编译环境

编译指令：

g++ -Wall -shared -std=gnu++11 -O2 -fvisibility=hidden -fPIC -I./ perfermance.cc -o libcppex.so `Python3-config --cflags --ldflags --libs`

如果 Python 环境未正确配置可能报错：

这里对 Python 的依赖是通过 Python3-config --cflags --ldflags --libs 来自动指定，可先单独运行此命令来验证 Python 依赖是否配置正确。Python3-config 正常执行依赖 Python3-dev，可以通过以下命令安装：

yum install Python3-devel

4. C++调 Python

一般 pybind11 都是用于给 C++代码封装 Python 端接口，但是反过来 C++调 Python 也是支持的。只需#include <pybind11/embed.h>头文件即可使用，内部是通过嵌入 CPython 解释器来实现。使用上也非常简单易用，同时有不错的可读性，与直接调用 Python 接口非常类似。比如对一个 numpy 数组调用一些方法，参考示例如下：

// C++
pyVec = pyVec.attr('transpose')().attr('reshape')(pyVec.size());

# Python
pyVec = pyVec.transpose().reshape(pyVec.size)

以下以我们开发的 C++ GPU 高性能版抽帧 so 为例，除了提供抽帧接口给到 Python 端调用，还需要回调给 Python 从而通知抽帧进度以及帧数据。

Python 端回调接口：

def on_decoding_callback(task_id:str, progress:int):
        print('decoding callback, task id: %s, progress: %d' % (task_id, progress))

if __name__ == '__main__':
    decoder = DecoderWrapper()
    decoder.register_py_callback(os.getcwd() + '/decode_test.py',
         'on_decoding_callback')

C++端接口注册 & 回调 Python：

#include <pybind11/embed.h>

int DecoderWrapper::register_py_callback(const std::string &py_path,
                        const std::string &func_name) {
        int ret = 0;
        const std::string &pyPath = py_get_module_path(py_path);
        const std::string &pyName = py_get_module_name(py_path);
        SoInfo('get py module name: %s, path: %s', pyName.c_str(), pyPath.c_str());

        py::gil_scoped_acquire acquire;

        py::object sys = py::module::import('sys');
        sys.attr('path').attr('append')(py::str(pyPath.c_str())); //Python脚本所在的路径
        py::module pyModule = py::module::import(pyName.c_str());
        if (pyModule == NULL) {
            LogError('Failed to load pyModule ..');
            py::gil_scoped_release release;
            return PYTHON_FILE_NOT_FOUND_ERROR;
        }
        if (py::hasattr(pyModule, func_name.c_str())) {
            py_callback = pyModule.attr(func_name.c_str());
        } else {
            ret = PYTHON_FUNC_NOT_FOUND_ERROR;
        }
        py::gil_scoped_release release;

        return ret;
    }

int DecoderListener::on_decoding_progress(std::string &task_id, int progress) {
    if (py_callback != NULL) {
        try {
            py::gil_scoped_acquire acquire;
            py_callback(task_id, progress);
            py::gil_scoped_release release;
        } catch (py::error_already_set const &PythonErr) {
            LogError('catched Python exception: %s', PythonErr.what());
        } catch (const std::exception &e)  {
            LogError('catched exception: %s', e.what());
        } catch (...) {
            LogError('catched unknown exception');
        }
    }
}

5. 数据类型转换

5.1 类成员函数

对于类和成员函数的 binding，首先需要构造对象，所以分为两步：第一步是包装实例构造方法，另一步是注册成员函数的访问方式。同时，也支持通过 def_static、def_readwrite 来绑定静态方法或成员变量，具体可参考官方文档[3]。

#include <pybind11/pybind11.h>

class Hello
{
public:
    Hello(){}
    void say( const std::string s ){
        std::cout << s << std::endl;
    }
};

PYBIND11_MODULE(py2cpp, m) {
    m.doc() = 'pybind11 example';

    pybind11::class_<Hello>(m, 'Hello')
        .def(pybind11::init())  //构造器，对应c++类的构造函数，如果没有声明或者参数不对，会导致调用失败
        .def( 'say', &Hello::say );
}

/*
Python 调用方式：
c = py2cpp.Hello()
c.say()
*/

5.2 STL 容器

pybind11 支持 STL 容器自动转换，当需要处理 STL 容器时，只要额外包括头文件<pybind11/stl.h>即可。pybind11 提供的自动转换包括：std::vector<>/std::list<>/std::array<> 转换成 Python list ；std::set<>/std::unordered_set<> 转换成 Python set ; std::map<>/std::unordered_map<> 转换成 dict 等。此外 std::pair<> 和 std::tuple<>的转换也在 <pybind11/pybind11.h> 头文件中提供了。

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

class ContainerTest {
public:
    ContainerTest() {}
    void Set(std::vector<int> v) {
        mv = v;
    }
private:
    std::vector<int> mv;
};

PYBIND11_MODULE( py2cpp, m ) {
    m.doc() = 'pybind11 example';
    pybind11::class_<ContainerTest>(m, 'CTest')
        .def( pybind11::init() )
        .def( 'set', &ContainerTest::Set );
}

/*
Python 调用方式：
c = py2cpp.CTest()
c.set([1,2,3])
*/

5.3 bytes、string 类型传递

由于在 Python3 中 string 类型默认为 UTF-8 编码，如果从 C++端传输 string 类型的 protobuf 数据到 Python，则会出现 “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte” 的报错。

解决方案：pybind11 提供了非文本数据的 binding 类型 py::bytes：

m.def('return_bytes',
    []() {
        std::string s('\xba\xd0\xba\xd0');  // Not valid UTF-8
        return py::bytes(s);  // Return the data without transcoding
    }
);

5.4 智能指针

std::unique_ptr pybind11 支持直接转换：

std::unique_ptr<Example> create_example() { return std::unique_ptr<Example>(new Example()); }
m.def('create_example', &create_example);

std::shared_ptr 需要特别注意的是，不能直接使用裸指针。如下的 get_child 函数在 Python 端调用会报内存访问异常（如 segmentation fault）。

class Child { };

class Parent {
public:
   Parent() : child(std::make_shared<Child>()) { }
   Child *get_child() { return child.get(); }  /* Hint: ** DON'T DO THIS ** */
private:
    std::shared_ptr<Child> child;
};

PYBIND11_MODULE(example, m) {
    py::class_<Child, std::shared_ptr<Child>>(m, 'Child');

    py::class_<Parent, std::shared_ptr<Parent>>(m, 'Parent')
       .def(py::init<>())
       .def('get_child', &Parent::get_child);
}

5.5 cv::Mat 到 numpy 转换

抽帧结果返回给 Python 端时，由于目前 pybind11 暂不支持自动转换 cv::Mat 数据结构，因此需要手动处理 C++ cv::Mat 和 Python 端 numpy 之间的绑定。转换代码如下：

/*
 Python->C++ Mat
*/
cv::Mat numpy_uint8_3c_to_cv_mat(py::array_t<uint8_t>& input) {
    if (input.ndim() != 3)
        throw std::runtime_error('3-channel image must be 3 dims ');

    py::buffer_info buf = input.request();
    cv::Mat mat(buf.shape[0], buf.shape[1], CV_8UC3, (uint8_t*)buf.ptr);
    return mat;
}

/*
 C++ Mat ->numpy
*/
py::array_t<uint8_t> cv_mat_uint8_3c_to_numpy(cv::Mat& input) {
    py::array_t<uint8_t> dst = py::array_t<uint8_t>({ input.rows,input.cols,3}, input.data);
    return dst;
}

5.6 zero copy

一般来说跨语言调用都产生性能上的 overhead，特别是对于大数据块的传递。因此，pybind11 也支持了数据地址传递的方式，避免了大数据块在内存中的拷贝操作，性能上提升很大。

class Matrix {
public:
    Matrix(size_t rows, size_t cols) : m_rows(rows), m_cols(cols) {
        m_data = new float[rows*cols];
    }
    float *data() { return m_data; }
    size_t rows() const { return m_rows; }
    size_t cols() const { return m_cols; }
private:
    size_t m_rows, m_cols;
    float *m_data;
};

py::class_<Matrix>(m, 'Matrix', py::buffer_protocol())
   .def_buffer([](Matrix &m) -> py::buffer_info {
        return py::buffer_info(
            m.data(),                               /* Pointer to buffer */
            sizeof(float),                          /* Size of one scalar */
            py::format_descriptor<float>::format(), /* Python struct-style format descriptor */
            2,                                      /* Number of dimensions */
            { m.rows(), m.cols() },                 /* Buffer dimensions */
            { sizeof(float) * m.cols(),             /* Strides (in bytes) for each index */
              sizeof(float) }
        );
    });

6. 落地 & 行业应用

上述方案，我们已在广告多媒体 AI 的色彩提取相关服务、GPU 高性能抽帧等算法中落地，取得了非常不错的提速效果。业内来说，目前市面上大部分 AI 计算框架，如 TensorFlow、Pytorch、阿里 X-Deep Learning、百度 PaddlePaddle 等，均使用 pybind11 来提供 C++到 Python 端接口封装，其稳定性以及性能均已得到广泛验证。

7. 结语

在 AI 领域普遍开源节流、降本提效的大背景下，如何充分利用好现有资源，提升资源利用率是关键。本文提供了一个非常便捷的提升 Python 算法服务性能，以及 CPU 利用率的解决方案，并在线上取得了不错的效果。

8. 附录

[1]https://docs./3/extending/index.html#extending-index

[2]https://github.com/tensorflow/community/blob/master/rfcs/20190208-pybind11.md#replace-swig-with-pybind11

[3]https://pybind11./en/stable/advanced/cast/index.html

给 Python 算法插上性能的翅膀——pybind11 落地实践