RK3399多线程异构跑MobilenetSSD，帧率终于超过10了

RK3399 AI Tengine

xukejing

关注

发布时间: 2019-08-06

丨

阅读: 4097

## 1 测试背景
RK3399是个大小核架构的处理器芯片，集成了2个A72核心，4个A53核心；并且，它内部还集成了Mali T860 GPU，支持OpenCL。去年，我们也基于OpenCV的DNN模块测试过MobilenetSSD，比如《深度神经网络帮您监视宠物拆家》。然而，由于算法没有优化，当时的测试成绩并不是太理想，竟然跑出了每秒2帧的糟糕成绩。虽然那个作品在庆科-阿里云的比赛上得了奖，但是我并不满足于现状，一直在寻找改进方案。

最近，我学会了一个新的神经网络推理引擎，名字叫Tengine，它针对ARM嵌入式设备做了优化。通过它，可以异构调度平台里的所有计算核心，充分发挥硬件算力。此外，Tengine还提供了常见AI应用算法，包括图像检测，人脸识别，语音识别等。不懂AI没关系，上手就能跑AI应用。Tengine同时还支持各类常见卷积神经网络，包括SqueezeNet，MobileNet，AlexNet，ResNet等，支持层融合、8位量化等优化策略。并且通过调用针对不同CPU微构架优化的HCL库，将ARM处理器芯片的性能充分挖掘出来。

Tengine支持INT8数据格式计算，相比FP32，精度几乎不变，却能带来2到3倍的性能提升，内存使用减少为三分之一。如果原始浮点模型是FP32数据格式的也不要紧。对于Tengine来说，也不需要对原始模型做任何修改，只需要打开Tengine的量化计算开关，Tengine自动在运行时对进行量化和计算，大幅度提升推理性能。同时得益于混合精度计算，大多数模型的精度保持不变，非常实用。

今天这篇文章，我们就用RK3399平台来测试一下基于Tengine引擎的MobilenetSSD算例，看看帧率可以达到多少；其中，测试硬件为Leez P710。
![LeezP710](https://cf03.ickimg.com/bbsimages/201908/db95578af190bcda7eeba0ea1e910933.jpg "LeezP710")

## 2 环境搭建和测试
1、安装依赖程序
```shell
sudo apt install libprotobuf-dev protobuf-compiler
sudo apt install libopencv-dev
sudo apt install libboost-all-dev libgoogle-glog-dev
sudo apt install scons
```

2、用git下载Tengine
![下载tengine](https://cf03.ickimg.com/bbsimages/201908/7acf9f608b5a6e9aa34b8816465cfa65.jpg "下载tengine")

3、为了支持GPU加速，我们还要用git下载一下ARM Compute Libery（ACL）
![下载ACL](https://cf03.ickimg.com/bbsimages/201908/3b1d2cb04be838ac799d3083567565b8.jpg "下载ACL")

4、在编译Tengine前，先编译ACL，其中编译前切换一下18.05版本分支
![编译ACL](https://cf03.ickimg.com/bbsimages/201908/ac571d94d761c6fe377f019854145fe3.jpg "编译ACL")

5、修改makefile文件，配置Tengine编译设置，支持ACL
![修改makefile文件](https://cf03.ickimg.com/bbsimages/201908/0e9188d3e83faaa6bd01dbafd4444131.jpg "修改makefile文件")

6、编译Tengine
![编译tengine](https://cf03.ickimg.com/bbsimages/201908/e51ed5e90c6a7f289f50485f1e1ab1fd.jpg "编译tengine")

7、跑两个bench例子测试一下，分别是用SqueezeNet和MobileNet网络识别照片，其中测试照片是猫咪。默认参数FP32数据格式下就跑出了大约每帧60毫秒的成绩。这个成绩相当不错。
![测试例子](https://cf03.ickimg.com/bbsimages/201908/4c0529cc9758040ee1b5773b61ea28ea.jpg "测试例子")

## 3 CPU GPU多线程异构测试
当我们要在一张图里检测猫狗与其他对象的相对位置时，就要使用MobielNet SSD算法了。
下面这个测试将创建3个线程，分别是cpu_thread_a53、cpu_thread_a72和gpu_thread。其中，GPU采用FP16数据格式，CPU采用INT8数据格式。
测试的打印输出如下：
![GPU成绩](https://cf03.ickimg.com/bbsimages/201908/5580dd635d5baef9612c061e61b0abf4.jpg "GPU成绩")
![cpu成绩](https://cf03.ickimg.com/bbsimages/201908/61fad15216f62181a6007c226c3bdc0b.jpg "cpu成绩")
由上图可看出，GPU每循环耗时约200毫秒，两个A72核心每循环耗时约217毫秒，四个A53核心每循环耗时314毫秒；合计帧率达到了12.7692帧每秒，帧率终于超过10了。

测试的图片输出如下所示，可以看出识别精度还不错。
![识别](https://cf03.ickimg.com/bbsimages/201908/d027d2f769d584d81a562acbb176a619.jpg "识别")

最后，CPU GPU多线程测试所使用的代码如下
```cpp
#include &lt;unistd.h&gt;
#include &lt;sys/time.h&gt;
#include &lt;iostream&gt;
#include &lt;iomanip&gt;
#include &lt;string&gt;
#include &lt;vector&gt;
#include &lt;memory&gt;
#include &lt;thread&gt;
#include &lt;mutex&gt;
#include &lt;atomic&gt;
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include "tengine_c_api.h"
#include "cpu_device.h"

#define DEF_PROTO "models/MobileNetSSD_deploy.prototxt"
#define DEF_MODEL "models/MobileNetSSD_deploy.caffemodel"
#define DEF_IMAGE "tests/images/ssd_dog.jpg"

std::atomic&lt;int&gt; thread_done;
int thread_num = 0;
std::string image_file;
std::string cpu_2A72_save_name = "cpu_2A72";
std::string cpu_4A53_save_name = "cpu_4A53";
std::string gpu_save_name = "gpu";

int cpu_2A72_repeat_count = 120;
int gpu_repeat_count = 105;
int cpu_4A53_repeat_count = 95;

volatile int barrier = 1;

struct Box
{
    float x0;
    float y0;
    float x1;
    float y1;
    int class_idx;
    float score;
};

void get_input_data_ssd(std::string&amp; image_file, float* input_data, int img_h, int img_w)
{
    cv::Mat img = cv::imread(image_file);

if(img.empty())
    {
        std::cerr &lt;&lt; "Failed to read image file " &lt;&lt; image_file &lt;&lt; ".\n";
        return;
    }

cv::resize(img, img, cv::Size(img_h, img_w));
    img.convertTo(img, CV_32FC3);
    float* img_data = ( float* )img.data;
    int hw = img_h * img_w;

float mean[3] = {127.5, 127.5, 127.5};
    for(int h = 0; h &lt; img_h; h++)
    {
        for(int w = 0; w &lt; img_w; w++)
        {
            for(int c = 0; c &lt; 3; c++)
            {
                input_data[c * hw + h * img_w + w] = 0.007843 * (*img_data - mean[c]);
                img_data++;
            }
        }
    }
}

void post_process_ssd(std::string&amp; image_file, float threshold, float* outdata, int num, const std::string&amp; save_name)
{
    const char* class_names[] = {"background", "aeroplane", "bicycle",   "bird",   "boat",        "bottle",
                                 "bus",        "car",       "cat",       "chair",  "cow",         "diningtable",
                                 "dog",        "horse",     "motorbike", "person", "pottedplant", "sheep",
                                 "sofa",       "train",     "tvmonitor"};

cv::Mat img = cv::imread(image_file);
    int raw_h = img.size().height;
    int raw_w = img.size().width;
    std::vector&lt;Box&gt; boxes;
    int line_width = raw_w * 0.005;
    printf("detect result num: %d \n", num);
    for(int i = 0; i &lt; num; i++)
    {
        if(outdata[1] &gt;= threshold)
        {
            Box box;
            box.class_idx = outdata[0];
            box.score = outdata[1];
            box.x0 = outdata[2] * raw_w;
            box.y0 = outdata[3] * raw_h;
            box.x1 = outdata[4] * raw_w;
            box.y1 = outdata[5] * raw_h;
            boxes.push_back(box);
            printf("%s\t:%.0f%%\n", class_names[box.class_idx], box.score * 100);
            printf("BOX:( %g , %g ),( %g , %g )\n", box.x0, box.y0, box.x1, box.y1);
        }
        outdata += 6;
    }
    for(int i = 0; i &lt; ( int )boxes.size(); i++)
    {
        Box box = boxes[i];
        cv::rectangle(img, cv::Rect(box.x0, box.y0, (box.x1 - box.x0), (box.y1 - box.y0)), cv::Scalar(255, 255, 0),
                      line_width);
        std::ostringstream score_str;
        score_str &lt;&lt; box.score;
        std::string label = std::string(class_names[box.class_idx]) + ": " + score_str.str();
        int baseLine = 0;
        cv::Size label_size = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &amp;baseLine);
        cv::rectangle(img,
                      cv::Rect(cv::Point(box.x0, box.y0 - label_size.height),
                               cv::Size(label_size.width, label_size.height + baseLine)),
                      cv::Scalar(255, 255, 0), CV_FILLED);
        cv::putText(img, label, cv::Point(box.x0, box.y0), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));
    }
    cv::imwrite(save_name, img);
    std::cout &lt;&lt; "======================================\n";
    std::cout &lt;&lt; "[DETECTED IMAGE SAVED]:\t" &lt;&lt; save_name &lt;&lt; "\n";
    std::cout &lt;&lt; "======================================\n";
}

void run_test(graph_t graph, const std::string&amp; save_name, int repeat_count, float* avg_time)
{
    int img_h = 300;
    int img_w = 300;
    int img_size = img_h * img_w * 3;
    float* input_data = ( float* )malloc(sizeof(float) * img_size);

int node_idx = 0;
    int tensor_idx = 0;
    tensor_t input_tensor = get_graph_input_tensor(graph, node_idx, tensor_idx);

if(input_tensor == nullptr)
    {
        printf("Get input node failed : node_idx: %d, tensor_idx: %d\n", node_idx, tensor_idx);
        return;
    }

int dims[] = {1, 3, img_h, img_w};
    set_tensor_shape(input_tensor, dims, 4);
    int ret_prerun = prerun_graph(graph);
    if(ret_prerun &lt; 0)
    {
        std::printf("prerun failed\n");
        return;
    }

if(save_name == "gpu")
    {
        // warm up
        get_input_data_ssd(image_file, input_data, img_h, img_w);
        set_tensor_buffer(input_tensor, input_data, img_size * 4);
        run_graph(graph, 1);
        barrier = 0;
    }
    else
    {
        while(barrier)
            ;
    }

struct timeval t0, t1;
    float total_time = 0.f;
    for(int i = 0; i &lt; repeat_count; i++)
    {
        get_input_data_ssd(image_file, input_data, img_h, img_w);

gettimeofday(&amp;t0, NULL);
        set_tensor_buffer(input_tensor, input_data, img_size * 4);
        run_graph(graph, 1);

gettimeofday(&amp;t1, NULL);
        float mytime = ( float )((t1.tv_sec * 1000000 + t1.tv_usec) - (t0.tv_sec * 1000000 + t0.tv_usec)) / 1000;
        total_time += mytime;
    }

std::cout &lt;&lt; "--------------------------------------\n";
    std::cout &lt;&lt; save_name &lt;&lt; ": repeat " &lt;&lt; repeat_count &lt;&lt; " times, avg " &lt;&lt; total_time / repeat_count
              &lt;&lt; " ms all: " &lt;&lt; total_time &lt;&lt; "ms\n";
    (*avg_time) = total_time / repeat_count;
    tensor_t out_tensor = get_graph_output_tensor(graph, 0, 0);    //"detection_out");
    int out_dim[4];
    get_tensor_shape(out_tensor, out_dim, 4);
    float* outdata = ( float* )get_tensor_buffer(out_tensor);
    int num = out_dim[1];
    float show_threshold = 0.5;

post_process_ssd(image_file, show_threshold, outdata, num, save_name + "_save.jpg");

release_graph_tensor(out_tensor);
    release_graph_tensor(input_tensor);

postrun_graph(graph);
    free(input_data);
    destroy_graph(graph);
}

void cpu_thread_a53(const char* pproto_file, const char* pmodel_file, float* avg_time)
{
    graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file);
    if(graph == nullptr)
    {
        thread_done++;
        return;
    }

if(set_graph_device(graph, "a53") &lt; 0)
    {
        std::cerr &lt;&lt; "set device a53 failed\n";
    }

run_test(graph, cpu_4A53_save_name, cpu_4A53_repeat_count, avg_time);
    thread_done++;
}
void cpu_thread_a72(const char* pproto_file, const char* pmodel_file, float* avg_time)
{
    graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file);
    if(graph == nullptr)
    {
        thread_done++;
        return;
    }

if(set_graph_device(graph, "a72") &lt; 0)
    {
        std::cerr &lt;&lt; "set device a72 failed\n";
    }

run_test(graph, cpu_2A72_save_name, cpu_2A72_repeat_count, avg_time);
    thread_done++;
}

void gpu_thread(const char* pproto_file, const char* pmodel_file, float* avg_time)
{
    graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file);

if(graph == nullptr)
    {
        thread_done++;
        return;
    }

set_graph_device(graph, "acl_opencl");

run_test(graph, gpu_save_name, gpu_repeat_count, avg_time);
    thread_done++;
}

int main(int argc, char* argv[])
{
    const std::string root_path;
    std::string proto_file;
    std::string model_file;
    const char* pproto_file;
    const char* pmodel_file;

int res;
    while((res = getopt(argc, argv, "p:m:i:hd:")) != -1)
    {
        switch(res)
        {
            case 'p':
                proto_file = optarg;
                break;
            case 'm':
                model_file = optarg;
                break;
            case 'i':
                image_file = optarg;
                break;
            case 'h':
                std::cout &lt;&lt; "[Usage]: " &lt;&lt; argv[0] &lt;&lt; " [-h]\n"
                          &lt;&lt; " [-p proto_file] [-m model_file] [-i image_file]\n";
                return 0;
            default:
                break;
        }
    }

if(proto_file.empty())
    {
        proto_file = root_path + DEF_PROTO;
        std::cout &lt;&lt; "proto file not specified,using " &lt;&lt; proto_file &lt;&lt; " by default\n";
    }
    if(model_file.empty())
    {
        model_file = root_path + DEF_MODEL;
        std::cout &lt;&lt; "model file not specified,using " &lt;&lt; model_file &lt;&lt; " by default\n";
    }
    if(image_file.empty())
    {
        image_file = root_path + DEF_IMAGE;
        std::cout &lt;&lt; "image file not specified,using " &lt;&lt; image_file &lt;&lt; " by default\n";
    }

/@@* do not let GPU run concat */
    setenv("GPU_CONCAT", "0", 1);
    /@@* using GPU fp16 */
    setenv("ACL_FP16", "1", 1);
    /@@* default CPU device using 0,1,2,3 */
    setenv("TENGINE_CPU_LIST", "2", 1);
    /@@* using fp32 or int8 */
    setenv("KERNEL_MODE", "2", 1);

// init tengine
    init_tengine();
    if(request_tengine_version("0.9") &lt; 0)
        return -1;
    // collect avg_time for each case
    float avg_times[3] = {0., 0., 0.};

// thread 0 for cpu 2A72
    const struct cpu_info* p_info = get_predefined_cpu("rk3399");
    int a72_list[] = {4, 5};
    set_online_cpu(( struct cpu_info* )p_info, a72_list, sizeof(a72_list) / sizeof(int));
    create_cpu_device("a72", p_info);

// thread 3 for cpu 4A53
    const struct cpu_info* p_info1 = get_predefined_cpu("rk3399");
    int a53_list[] = {0, 1, 2, 3};
    set_online_cpu(( struct cpu_info* )p_info1, a53_list, sizeof(a53_list) / sizeof(int));
    create_cpu_device("a53", p_info1);
#if 0
    if (load_model(model_name, "caffe", proto_file.c_str(), model_file.c_str()) &lt; 0)
    {
        std::cout&lt;&lt;"load model failed\n";
        return 1;
    }
    std::cout &lt;&lt; "load model done!\n";
#endif
    pproto_file = proto_file.c_str();
    pmodel_file = model_file.c_str();
    thread_done = 0;
    std::thread* t0 = new std::thread(cpu_thread_a72, pproto_file, pmodel_file, &amp;avg_times[0]);
    thread_num++;

// thread 1 for gpu +1 A53
    std::thread* t1 = new std::thread(gpu_thread, pproto_file, pmodel_file, &amp;avg_times[1]);
    thread_num++;

std::thread* t2 = new std::thread(cpu_thread_a53, pproto_file, pmodel_file, &amp;avg_times[2]);
    thread_num++;

t0-&gt;join();
    delete t0;

t1-&gt;join();
    delete t1;

t2-&gt;join();
    delete t2;

std::cout &lt;&lt; "thread_done: " &lt;&lt; ( int )thread_done &lt;&lt; "\ntest done\n";

std::cout &lt;&lt; "=================================================\n";
    std::cout &lt;&lt; " Using 3 thread, MSSD performance "
              &lt;&lt; (1000. / avg_times[0] + 1000. / avg_times[1] + 1000. / avg_times[2]) &lt;&lt; " FPS \n";
    std::cout &lt;&lt; "=================================================\n";

release_tengine();

return 0;
}

```

原创作品，未经权利人授权禁止转载。详情见转载须知。举报文章

xukejing 擅长:其他应用

关注

登录后可评论，请登录或注册