电子工程师技术服务社区
公告
登录
|
注册
首页
技术问答
厂商活动
正点原子
板卡试用
资源库
下载
文章
社区首页
文章
RK3399多线程异构跑MobilenetSSD,帧率终于超过10了
分 享
扫描二维码分享
RK3399多线程异构跑MobilenetSSD,帧率终于超过10了
RK3399
AI
Tengine
xukejing
关注
发布时间: 2019-08-06
丨
阅读: 4097
## 1 测试背景 RK3399是个大小核架构的处理器芯片,集成了2个A72核心,4个A53核心;并且,它内部还集成了Mali T860 GPU,支持OpenCL。去年,我们也基于OpenCV的DNN模块测试过MobilenetSSD,比如《深度神经网络帮您监视宠物拆家》。然而,由于算法没有优化,当时的测试成绩并不是太理想,竟然跑出了每秒2帧的糟糕成绩。虽然那个作品在庆科-阿里云的比赛上得了奖,但是我并不满足于现状,一直在寻找改进方案。 最近,我学会了一个新的神经网络推理引擎,名字叫Tengine,它针对ARM嵌入式设备做了优化。通过它,可以异构调度平台里的所有计算核心,充分发挥硬件算力。此外,Tengine还提供了常见AI应用算法,包括图像检测,人脸识别,语音识别等。不懂AI没关系,上手就能跑AI应用。Tengine同时还支持各类常见卷积神经网络,包括SqueezeNet,MobileNet,AlexNet,ResNet等,支持层融合、8位量化等优化策略。并且通过调用针对不同CPU微构架优化的HCL库,将ARM处理器芯片的性能充分挖掘出来。 Tengine支持INT8数据格式计算,相比FP32,精度几乎不变,却能带来2到3倍的性能提升,内存使用减少为三分之一。如果原始浮点模型是FP32数据格式的也不要紧。对于Tengine来说,也不需要对原始模型做任何修改,只需要打开Tengine的量化计算开关,Tengine自动在运行时对进行量化和计算,大幅度提升推理性能。同时得益于混合精度计算,大多数模型的精度保持不变,非常实用。 今天这篇文章,我们就用RK3399平台来测试一下基于Tengine引擎的MobilenetSSD算例,看看帧率可以达到多少;其中,测试硬件为Leez P710。 ![LeezP710](https://cf03.ickimg.com/bbsimages/201908/db95578af190bcda7eeba0ea1e910933.jpg "LeezP710") ## 2 环境搭建和测试 1、安装依赖程序 ```shell sudo apt install libprotobuf-dev protobuf-compiler sudo apt install libopencv-dev sudo apt install libboost-all-dev libgoogle-glog-dev sudo apt install scons ``` 2、用git下载Tengine ![下载tengine](https://cf03.ickimg.com/bbsimages/201908/7acf9f608b5a6e9aa34b8816465cfa65.jpg "下载tengine") 3、为了支持GPU加速,我们还要用git下载一下ARM Compute Libery(ACL) ![下载ACL](https://cf03.ickimg.com/bbsimages/201908/3b1d2cb04be838ac799d3083567565b8.jpg "下载ACL") 4、在编译Tengine前,先编译ACL,其中编译前切换一下18.05版本分支 ![编译ACL](https://cf03.ickimg.com/bbsimages/201908/ac571d94d761c6fe377f019854145fe3.jpg "编译ACL") 5、修改makefile文件,配置Tengine编译设置,支持ACL ![修改makefile文件](https://cf03.ickimg.com/bbsimages/201908/0e9188d3e83faaa6bd01dbafd4444131.jpg "修改makefile文件") 6、编译Tengine ![编译tengine](https://cf03.ickimg.com/bbsimages/201908/e51ed5e90c6a7f289f50485f1e1ab1fd.jpg "编译tengine") 7、跑两个bench例子测试一下,分别是用SqueezeNet和MobileNet网络识别照片,其中测试照片是猫咪。默认参数FP32数据格式下就跑出了大约每帧60毫秒的成绩。这个成绩相当不错。 ![测试例子](https://cf03.ickimg.com/bbsimages/201908/4c0529cc9758040ee1b5773b61ea28ea.jpg "测试例子") ## 3 CPU GPU多线程异构测试 当我们要在一张图里检测猫狗与其他对象的相对位置时,就要使用MobielNet SSD算法了。 下面这个测试将创建3个线程,分别是cpu_thread_a53、cpu_thread_a72和gpu_thread。其中,GPU采用FP16数据格式,CPU采用INT8数据格式。 测试的打印输出如下: ![GPU成绩](https://cf03.ickimg.com/bbsimages/201908/5580dd635d5baef9612c061e61b0abf4.jpg "GPU成绩") ![cpu成绩](https://cf03.ickimg.com/bbsimages/201908/61fad15216f62181a6007c226c3bdc0b.jpg "cpu成绩") 由上图可看出,GPU每循环耗时约200毫秒,两个A72核心每循环耗时约217毫秒,四个A53核心每循环耗时314毫秒;合计帧率达到了12.7692帧每秒,帧率终于超过10了。 测试的图片输出如下所示,可以看出识别精度还不错。 ![识别](https://cf03.ickimg.com/bbsimages/201908/d027d2f769d584d81a562acbb176a619.jpg "识别") 最后,CPU GPU多线程测试所使用的代码如下 ```cpp #include <unistd.h> #include <sys/time.h> #include <iostream> #include <iomanip> #include <string> #include <vector> #include <memory> #include <thread> #include <mutex> #include <atomic> #include "opencv2/imgproc/imgproc.hpp" #include "opencv2/highgui/highgui.hpp" #include "tengine_c_api.h" #include "cpu_device.h" #define DEF_PROTO "models/MobileNetSSD_deploy.prototxt" #define DEF_MODEL "models/MobileNetSSD_deploy.caffemodel" #define DEF_IMAGE "tests/images/ssd_dog.jpg" std::atomic<int> thread_done; int thread_num = 0; std::string image_file; std::string cpu_2A72_save_name = "cpu_2A72"; std::string cpu_4A53_save_name = "cpu_4A53"; std::string gpu_save_name = "gpu"; int cpu_2A72_repeat_count = 120; int gpu_repeat_count = 105; int cpu_4A53_repeat_count = 95; volatile int barrier = 1; struct Box { float x0; float y0; float x1; float y1; int class_idx; float score; }; void get_input_data_ssd(std::string& image_file, float* input_data, int img_h, int img_w) { cv::Mat img = cv::imread(image_file); if(img.empty()) { std::cerr << "Failed to read image file " << image_file << ".\n"; return; } cv::resize(img, img, cv::Size(img_h, img_w)); img.convertTo(img, CV_32FC3); float* img_data = ( float* )img.data; int hw = img_h * img_w; float mean[3] = {127.5, 127.5, 127.5}; for(int h = 0; h < img_h; h++) { for(int w = 0; w < img_w; w++) { for(int c = 0; c < 3; c++) { input_data[c * hw + h * img_w + w] = 0.007843 * (*img_data - mean[c]); img_data++; } } } } void post_process_ssd(std::string& image_file, float threshold, float* outdata, int num, const std::string& save_name) { const char* class_names[] = {"background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"}; cv::Mat img = cv::imread(image_file); int raw_h = img.size().height; int raw_w = img.size().width; std::vector<Box> boxes; int line_width = raw_w * 0.005; printf("detect result num: %d \n", num); for(int i = 0; i < num; i++) { if(outdata[1] >= threshold) { Box box; box.class_idx = outdata[0]; box.score = outdata[1]; box.x0 = outdata[2] * raw_w; box.y0 = outdata[3] * raw_h; box.x1 = outdata[4] * raw_w; box.y1 = outdata[5] * raw_h; boxes.push_back(box); printf("%s\t:%.0f%%\n", class_names[box.class_idx], box.score * 100); printf("BOX:( %g , %g ),( %g , %g )\n", box.x0, box.y0, box.x1, box.y1); } outdata += 6; } for(int i = 0; i < ( int )boxes.size(); i++) { Box box = boxes[i]; cv::rectangle(img, cv::Rect(box.x0, box.y0, (box.x1 - box.x0), (box.y1 - box.y0)), cv::Scalar(255, 255, 0), line_width); std::ostringstream score_str; score_str << box.score; std::string label = std::string(class_names[box.class_idx]) + ": " + score_str.str(); int baseLine = 0; cv::Size label_size = cv::getTextSize(label, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine); cv::rectangle(img, cv::Rect(cv::Point(box.x0, box.y0 - label_size.height), cv::Size(label_size.width, label_size.height + baseLine)), cv::Scalar(255, 255, 0), CV_FILLED); cv::putText(img, label, cv::Point(box.x0, box.y0), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0)); } cv::imwrite(save_name, img); std::cout << "======================================\n"; std::cout << "[DETECTED IMAGE SAVED]:\t" << save_name << "\n"; std::cout << "======================================\n"; } void run_test(graph_t graph, const std::string& save_name, int repeat_count, float* avg_time) { int img_h = 300; int img_w = 300; int img_size = img_h * img_w * 3; float* input_data = ( float* )malloc(sizeof(float) * img_size); int node_idx = 0; int tensor_idx = 0; tensor_t input_tensor = get_graph_input_tensor(graph, node_idx, tensor_idx); if(input_tensor == nullptr) { printf("Get input node failed : node_idx: %d, tensor_idx: %d\n", node_idx, tensor_idx); return; } int dims[] = {1, 3, img_h, img_w}; set_tensor_shape(input_tensor, dims, 4); int ret_prerun = prerun_graph(graph); if(ret_prerun < 0) { std::printf("prerun failed\n"); return; } if(save_name == "gpu") { // warm up get_input_data_ssd(image_file, input_data, img_h, img_w); set_tensor_buffer(input_tensor, input_data, img_size * 4); run_graph(graph, 1); barrier = 0; } else { while(barrier) ; } struct timeval t0, t1; float total_time = 0.f; for(int i = 0; i < repeat_count; i++) { get_input_data_ssd(image_file, input_data, img_h, img_w); gettimeofday(&t0, NULL); set_tensor_buffer(input_tensor, input_data, img_size * 4); run_graph(graph, 1); gettimeofday(&t1, NULL); float mytime = ( float )((t1.tv_sec * 1000000 + t1.tv_usec) - (t0.tv_sec * 1000000 + t0.tv_usec)) / 1000; total_time += mytime; } std::cout << "--------------------------------------\n"; std::cout << save_name << ": repeat " << repeat_count << " times, avg " << total_time / repeat_count << " ms all: " << total_time << "ms\n"; (*avg_time) = total_time / repeat_count; tensor_t out_tensor = get_graph_output_tensor(graph, 0, 0); //"detection_out"); int out_dim[4]; get_tensor_shape(out_tensor, out_dim, 4); float* outdata = ( float* )get_tensor_buffer(out_tensor); int num = out_dim[1]; float show_threshold = 0.5; post_process_ssd(image_file, show_threshold, outdata, num, save_name + "_save.jpg"); release_graph_tensor(out_tensor); release_graph_tensor(input_tensor); postrun_graph(graph); free(input_data); destroy_graph(graph); } void cpu_thread_a53(const char* pproto_file, const char* pmodel_file, float* avg_time) { graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file); if(graph == nullptr) { thread_done++; return; } if(set_graph_device(graph, "a53") < 0) { std::cerr << "set device a53 failed\n"; } run_test(graph, cpu_4A53_save_name, cpu_4A53_repeat_count, avg_time); thread_done++; } void cpu_thread_a72(const char* pproto_file, const char* pmodel_file, float* avg_time) { graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file); if(graph == nullptr) { thread_done++; return; } if(set_graph_device(graph, "a72") < 0) { std::cerr << "set device a72 failed\n"; } run_test(graph, cpu_2A72_save_name, cpu_2A72_repeat_count, avg_time); thread_done++; } void gpu_thread(const char* pproto_file, const char* pmodel_file, float* avg_time) { graph_t graph = create_graph(NULL, "caffe", pproto_file, pmodel_file); if(graph == nullptr) { thread_done++; return; } set_graph_device(graph, "acl_opencl"); run_test(graph, gpu_save_name, gpu_repeat_count, avg_time); thread_done++; } int main(int argc, char* argv[]) { const std::string root_path; std::string proto_file; std::string model_file; const char* pproto_file; const char* pmodel_file; int res; while((res = getopt(argc, argv, "p:m:i:hd:")) != -1) { switch(res) { case 'p': proto_file = optarg; break; case 'm': model_file = optarg; break; case 'i': image_file = optarg; break; case 'h': std::cout << "[Usage]: " << argv[0] << " [-h]\n" << " [-p proto_file] [-m model_file] [-i image_file]\n"; return 0; default: break; } } if(proto_file.empty()) { proto_file = root_path + DEF_PROTO; std::cout << "proto file not specified,using " << proto_file << " by default\n"; } if(model_file.empty()) { model_file = root_path + DEF_MODEL; std::cout << "model file not specified,using " << model_file << " by default\n"; } if(image_file.empty()) { image_file = root_path + DEF_IMAGE; std::cout << "image file not specified,using " << image_file << " by default\n"; } /@@* do not let GPU run concat */ setenv("GPU_CONCAT", "0", 1); /@@* using GPU fp16 */ setenv("ACL_FP16", "1", 1); /@@* default CPU device using 0,1,2,3 */ setenv("TENGINE_CPU_LIST", "2", 1); /@@* using fp32 or int8 */ setenv("KERNEL_MODE", "2", 1); // init tengine init_tengine(); if(request_tengine_version("0.9") < 0) return -1; // collect avg_time for each case float avg_times[3] = {0., 0., 0.}; // thread 0 for cpu 2A72 const struct cpu_info* p_info = get_predefined_cpu("rk3399"); int a72_list[] = {4, 5}; set_online_cpu(( struct cpu_info* )p_info, a72_list, sizeof(a72_list) / sizeof(int)); create_cpu_device("a72", p_info); // thread 3 for cpu 4A53 const struct cpu_info* p_info1 = get_predefined_cpu("rk3399"); int a53_list[] = {0, 1, 2, 3}; set_online_cpu(( struct cpu_info* )p_info1, a53_list, sizeof(a53_list) / sizeof(int)); create_cpu_device("a53", p_info1); #if 0 if (load_model(model_name, "caffe", proto_file.c_str(), model_file.c_str()) < 0) { std::cout<<"load model failed\n"; return 1; } std::cout << "load model done!\n"; #endif pproto_file = proto_file.c_str(); pmodel_file = model_file.c_str(); thread_done = 0; std::thread* t0 = new std::thread(cpu_thread_a72, pproto_file, pmodel_file, &avg_times[0]); thread_num++; // thread 1 for gpu +1 A53 std::thread* t1 = new std::thread(gpu_thread, pproto_file, pmodel_file, &avg_times[1]); thread_num++; std::thread* t2 = new std::thread(cpu_thread_a53, pproto_file, pmodel_file, &avg_times[2]); thread_num++; t0->join(); delete t0; t1->join(); delete t1; t2->join(); delete t2; std::cout << "thread_done: " << ( int )thread_done << "\ntest done\n"; std::cout << "=================================================\n"; std::cout << " Using 3 thread, MSSD performance " << (1000. / avg_times[0] + 1000. / avg_times[1] + 1000. / avg_times[2]) << " FPS \n"; std::cout << "=================================================\n"; release_tengine(); return 0; } ```
原创作品,未经权利人授权禁止转载。详情见
转载须知
。
举报文章
点赞
(
0
)
xukejing
擅长:其他应用
关注
评论
(0)
登录后可评论,请
登录
或
注册
相关文章推荐
MK-米客方德推出工业级存储卡
Beetle ESP32 C3 蓝牙数据收发
Beetle ESP32 C3 wifi联网获取实时天气信息
开箱测评Beetle ESP32-C3 (RISC-V芯片)模块
正点原子数控电源DP100测评
DP100试用评测-----开箱+初体验
Beetle ESP32 C3环境搭建
【花雕体验】16 使用Beetle ESP32 C3控制8X32位WS2812硬屏之二
X
你的打赏是对原创作者最大的认可
请选择打赏IC币的数量,一经提交无法退回 !
100IC币
500IC币
1000IC币
自定义
IC币
确定
X
提交成功 ! 谢谢您的支持
返回
我要举报该内容理由
×
广告及垃圾信息
抄袭或未经授权
其它举报理由
请输入您举报的理由(50字以内)
取消
提交