TVM加速模型,优化推断
TVM 是一个开源深度学习编译器,可适用于各类 CPUs, GPUs 及其他专用加速器。它的目标是使得我们能够在任何硬件上优化和运行自己的模型。不同于深度学习框架关注模型生产力,TVM 更关注模型在硬件上的性能和效率。
本文只简单介绍 TVM 的编译流程,及如何自动调优自己的模型。更深入了解,可见 TVM 官方内容: 文档: https://tvm.apache.org/docs/ 源码: https://github.com/apache/tvm 编译流程
TVM 文档 Design and Architecture[1] 讲述了实例编译流程、逻辑结构组件、设备目标实现等。其中流程见下图:
从高层次上看,包含了如下步骤: 导入(Import):前端组件将模型提取进 IRModule,其是模型内部表示(IR)的函数集合。 转换(Transformation):编译器将 IRModule 转换为另一个功能等效或近似等效(如量化情况下)的 IRModule。大多转换都是独立于目标(后端)的。TVM 也允许目标影响转换通道的配置。 目标翻译(Target Translation):编译器翻译(代码生成) IRModule 到目标上的可执行格式。目标翻译结果被封装为 runtime.Module,可以在目标运行时环境中导出、加载和执行。 运行时执行(Runtime Execution):用户加载一个 runtime.Module 并在支持的运行时环境中运行编译好的函数。 调优模型
TVM 文档 User Tutorial[2] 从怎么编译优化模型开始,逐步深入到 TE, TensorIR, Relay 等更底层的逻辑结构组件。
这里只讲下如何用 AutoTVM 自动调优模型,实际了解 TVM 编译、调优、运行模型的过程。原文见 Compiling and Optimizing a Model with the Python Interface (AutoTVM)[3] 。准备 TVM
首先,安装 TVM。可见文档 Installing TVM[4] ,或笔记「TVM 安装」[5] 。
之后,即可通过 TVM Python API 来调优模型。我们先导入如下依赖: import onnx from tvm.contrib.download import download_testdata from PIL import Image import numpy as np import tvm.relay as relay import tvm from tvm.contrib import graph_executor 准备模型,并加载
获取预训练的 ResNet-50 v2 ONNX 模型,并加载: model_url = "".join( [ "https://github.com/onnx/models/raw/", "main/vision/classification/resnet/model/", "resnet50-v2-7.onnx", ] ) model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx") onnx_model = onnx.load(model_path) 准备图片,并前处理
获取一张测试图片,并前处理成 224x224 NCHW 格式: img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg" img_path = download_testdata(img_url, "imagenet_cat.png", module="data") # Resize it to 224x224 resized_image = Image.open(img_path).resize((224, 224)) img_data = np.asarray(resized_image).astype("float32") # Our input image is in HWC layout while ONNX expects CHW input, so convert the array img_data = np.transpose(img_data, (2, 0, 1)) # Normalize according to the ImageNet input specification imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1)) imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1)) norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev # Add the batch dimension, as we are expecting 4-dimensional input: NCHW. img_data = np.expand_dims(norm_img_data, axis=0) 编译模型,用 TVM Relay
TVM 导入 ONNX 模型成 Relay,并创建 TVM 图模型: target = input("target [llvm]: ") if not target: target = "llvm" # target = "llvm -mcpu=core-avx2" # target = "llvm -mcpu=skylake-avx512" # The input name may vary across model types. You can use a tool # like Netron to check input names input_name = "data" shape_dict = {input_name: img_data.shape} mod, params = relay.frontend.from_onnx(onnx_model, shape_dict) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev ""default""))
其中 target 是目标硬件平台。llvm 指用 CPU,建议指明架构指令集,可更优化性能。如下命令可查看 CPU:$ llc --version | grep CPU Host CPU: skylake $ lscpu
或直接上厂商网站(如 Intel Products[6] )查看产品参数。运行模型,用 TVM Runtime
用 TVM Runtime 运行模型,进行预测: dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() 收集优化前的性能数据
收集优化前的性能数据: import timeit timing_number = 10 timing_repeat = 10 unoptimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) unoptimized = { "mean": np.mean(unoptimized), "median": np.median(unoptimized), "std": np.std(unoptimized), } print(unoptimized)
之后,用以对比优化后的性能。 后处理输出,得知预测结果
输出的预测结果,后处理成可读的分类结果: from scipy.special import softmax # Download a list of labels labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt" labels_path = download_testdata(labels_url, "synset.txt", module="data") with open(labels_path, "r") as f: labels = [l.rstrip() for l in f] # Open the output and read the output tensor scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class="%s" with probability=%f" % (labels[rank], scores[rank])) 调优模型,获取调优数据
于目标硬件平台,用 AutoTVM 自动调优,获取调优数据: import tvm.auto_scheduler as auto_scheduler from tvm.autotvm.tuner import XGBTuner from tvm import autotvm number = 10 repeat = 1 min_repeat_ms = 0 # since we"re tuning on a CPU, can be set to 0 timeout = 10 # in seconds # create a TVM runner runner = autotvm.LocalRunner( number=number, repeat=repeat, timeout=timeout, min_repeat_ms=min_repeat_ms, enable_cpu_cache_flush=True, ) tuning_option = { "tuner": "xgb", "trials": 10, "early_stopping": 100, "measure_option": autotvm.measure_option( builder=autotvm.LocalBuilder(build_func="default"), runner=runner ), "tuning_records": "resnet-50-v2-autotuning.json", } # begin by extracting the tasks from the onnx model tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params) # Tune the extracted tasks sequentially. for i, task in enumerate(tasks): prefix = "[Task %2d/%2d] " % (i + 1, len(tasks)) tuner_obj = XGBTuner(task, loss_type="rank") tuner_obj.tune( n_trial=min(tuning_option["trials"], len(task.config_space)), early_stopping=tuning_option["early_stopping"], measure_option=tuning_option["measure_option"], callbacks=[ autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix), autotvm.callback.log_to_file(tuning_option["tuning_records"]), ], )
上述 tuning_option 选用的 XGBoost Grid 算法进行优化搜索,数据记录进 tuning_records 。重编译模型,用调优数据
重新编译出一个优化模型,依据调优数据: with autotvm.apply_history_best(tuning_option["tuning_records"]): with tvm.transform.PassContext(opt_level=3, config={}): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev ""default"")) # Verify that the optimized model runs and produces the same results dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class="%s" with probability=%f" % (labels[rank], scores[rank])) 对比调优与非调优模型
收集优化后的性能数据,与优化前的对比: import timeit timing_number = 10 timing_repeat = 10 optimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)} print("optimized: %s" % (optimized)) print("unoptimized: %s" % (unoptimized))
调优模型,整个过程的运行结果,如下: $ time python autotvm_tune.py # TVM 编译运行模型 ## Downloading and Loading the ONNX Model ## Downloading, Preprocessing, and Loading the Test Image ## Compile the Model With Relay target [llvm]: llvm -mcpu=core-avx2 One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. ## Execute on the TVM Runtime ## Collect Basic Performance Data {"mean": 44.97057118016528, "median": 42.52320024970686, "std": 6.870915251002107} ## Postprocess the output class="n02123045 tabby, tabby cat" with probability=0.621104 class="n02123159 tiger cat" with probability=0.356378 class="n02124075 Egyptian cat" with probability=0.019712 class="n02129604 tiger, Panthera tigris" with probability=0.001215 class="n04040759 radiator" with probability=0.000262 # AutoTVM 调优模型 [Y/n] ## Tune the model [Task 1/25] Current/Best: 156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done. [Task 2/25] Current/Best: 54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done. [Task 3/25] Current/Best: 116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 4/25] Current/Best: 119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 5/25] Current/Best: 48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done. [Task 6/25] Current/Best: 156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 7/25] Current/Best: 92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done. [Task 8/25] Current/Best: 50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done. [Task 9/25] Current/Best: 188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done. [Task 10/25] Current/Best: 44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done. [Task 11/25] Current/Best: 83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done. [Task 12/25] Current/Best: 96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 13/25] Current/Best: 123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done. [Task 14/25] Current/Best: 23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done. [Task 15/25] Current/Best: 119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 16/25] Current/Best: 101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 17/25] Current/Best: 309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done. [Task 18/25] Current/Best: 54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done. [Task 19/25] Current/Best: 78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done. [Task 20/25] Current/Best: 40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done. [Task 21/25] Current/Best: 169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done. [Task 22/25] Current/Best: 90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done. [Task 23/25] Current/Best: 48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done. [Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/10) | 0.00 s Done. [Task 25/25] Current/Best: 25.50/ 33.86 GFLOPS | Progress: (10/10) | 9.28 s Done. ## Compiling an Optimized Model with Tuning Data class="n02123045 tabby, tabby cat" with probability=0.621104 class="n02123159 tiger cat" with probability=0.356378 class="n02124075 Egyptian cat" with probability=0.019712 class="n02129604 tiger, Panthera tigris" with probability=0.001215 class="n04040759 radiator" with probability=0.000262 ## Comparing the Tuned and Untuned Models optimized: {"mean": 34.736288779822644, "median": 34.547542000655085, "std": 0.5144378649382363} unoptimized: {"mean": 44.97057118016528, "median": 42.52320024970686, "std": 6.870915251002107} real 3m23.904s user 5m2.900s sys 5m37.099s
对比性能数据,可以发现:调优模型的运行速度更快、更平稳。 参考笔记: start-ai-compiler[7] 资料: 2020 / The Deep Learning Compiler: A Comprehensive Survey[8] [译] 深度学习编译器综述[9] 2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning[10] [译] TVM: 一个自动的端到端深度学习优化编译器[11] 脚注
[1] Design and Architecture: https://tvm.apache.org/docs/arch/index.html
[2] User Tutorial: https://tvm.apache.org/docs/tutorial/index.html
[3] Compiling and Optimizing a Model with the Python Interface (AutoTVM): https://tvm.apache.org/docs/tutorial/autotvm_relay_x86.html
[4] Installing TVM: https://tvm.apache.org/docs/tutorial/install.html
[5]「TVM 安装」: https://github.com/ikuokuo/start-ai-compiler/blob/main/docs/tvm/tvm_install.md
[6] Intel Products: https://www.intel.com/content/www/us/en/products/overview.html
[7] start-ai-compiler: https://github.com/ikuokuo/start-ai-compiler#%E7%AC%94%E8%AE%B0
[8] 2020 / The Deep Learning Compiler: A Comprehensive Survey: https://arxiv.org/abs/2002.03794
[9] [译] 深度学习编译器综述: https://www.jianshu.com/p/ed372af7ef09
[10] 2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning: https://www.usenix.org/conference/osdi18/presentation/chen
[11] [译] TVM: 一个自动的端到端深度学习优化编译器: https://zhuanlan.zhihu.com/p/426994569
NFC功能真的实用?卢伟冰大力推行,红米千元机也会标配科技进步的脚步从未停下,随着技术的提升,我们的日常生活方式也有了很大改善,就拿智能手机来说,各种各样的新功能出现,当大家适应之后竟然发现意外的好用,并且一直保留至今,比如红外遥控和
老妈要换一款长续航千元机千元机推荐第一波老妈今天说要换一款千元新手机,要求续航好耐用就行真我Q3立马浮现在我的脑子里面。oppo品牌是国产手机中的佼佼者,通过多年发展,这个品牌在品控上做到了极致,要说国产手机品质那还真得
小米OV都能用?华为鸿蒙OS将开源,部分功能比安卓更强点击右上方关注,第一时间获取科技资讯技能攻略产品体验,私信我回复01,送你一份玩机技能大礼包。2019年,华为正式公布harmonyOS,当时就对外称,系统将开源。5月24日,在鸿
明确了!数字人民币不会替代支付宝,你会怎么选择?关注币圈的人们可能非常清楚,作为人民币数字化的产物,数字人民币自从2014年以来就不断有消息传来,但从2020年起,随着数字人民币在深圳苏州等城市开展小范围测试,它的脚步也离我们越
靠消费贾跃亭,孙宏斌能复活乐视吗?文布谷编辑李信近期,乐视开了一场发布会,乐视智能生态高级市场总监吴国平表示,这本是一次中小型的发布会,但不料因为舆论的宣传,让原本人数并不多的现场变得拥挤。造成这一次乐观发布会人数
是不是小米性价比最高?只是部分机型最高的从硬件来看,小米的确是部分产品性价比最高的。如小米11RedmiK系列(今年的RedmiK40)以及RedmiNote系列等。小米是靠MIUI系统和高性价比手机起家的。这也是小米的
一夜之间血本无归,现在的币圈太可怕了这几天,币圈掀起了一场腥风血雨。5月19日,比特币一度跌破3万美元,暴跌近30。超过57万人爆仓,爆仓金额达443亿元。虽然后续回升至3。9万美元左右,但一夜暴负是跑不掉的了。炒币
泡面盖已成过去式,iPadPro将与MacBook平起平坐在智能手机刚起步的阶段,苹果率先带来了iPad,为掌上影音娱乐提供了更多可能性。然而随着手机屏幕尺寸的不断增加,iPad的优势被不断削弱,使用场景也越来越少,这才有了买前生产力,买
抖音快手等105款App被通报近期,针对人民群众反映强烈的App非法获取超范围收集过度索权等侵害个人信息的现象,国家互联网信息办公室依据中华人民共和国网络安全法App违法违规收集使用个人信息行为认定方法常见类型
Edge新特性新增迷你右键菜单内置在线词典优化密码监控基于Chromium的新版Edge浏览器在近日的更新中,终于迁移了经典版Edge浏览器中的在线字典功能。此外在最新版本中,微软似乎正在开发迷你版右键菜单,并对密码监控工具进行改善。
华为喊话了?小米的选择很关键谷歌中断与华为合作让其操作系统面临巨大难题。不过已经有所准备的华为推出了鸿蒙系统用以应对,虽然并没有第一时间上线,但也稳住了不少用户。至少在2020年前期华为依然能够取得骄人的销售