TVM加速模型，优化推断

　　TVM 是一个开源深度学习编译器，可适用于各类 CPUs, GPUs 及其他专用加速器。它的目标是使得我们能够在任何硬件上优化和运行自己的模型。不同于深度学习框架关注模型生产力，TVM 更关注模型在硬件上的性能和效率。
　　本文只简单介绍 TVM 的编译流程，及如何自动调优自己的模型。更深入了解，可见 TVM 官方内容： 文档: https://tvm.apache.org/docs/ 源码: https://github.com/apache/tvm 编译流程
　　TVM 文档  Design and Architecture[1]  讲述了实例编译流程、逻辑结构组件、设备目标实现等。其中流程见下图：
　　从高层次上看，包含了如下步骤： 导入（Import）：前端组件将模型提取进 IRModule，其是模型内部表示（IR）的函数集合。 转换（Transformation）：编译器将 IRModule 转换为另一个功能等效或近似等效（如量化情况下）的 IRModule。大多转换都是独立于目标（后端）的。TVM 也允许目标影响转换通道的配置。 目标翻译（Target Translation）：编译器翻译（代码生成） IRModule 到目标上的可执行格式。目标翻译结果被封装为 runtime.Module，可以在目标运行时环境中导出、加载和执行。 运行时执行（Runtime Execution）：用户加载一个 runtime.Module 并在支持的运行时环境中运行编译好的函数。 调优模型
　　TVM 文档  User Tutorial[2]  从怎么编译优化模型开始，逐步深入到 TE, TensorIR, Relay 等更底层的逻辑结构组件。
　　这里只讲下如何用 AutoTVM 自动调优模型，实际了解 TVM 编译、调优、运行模型的过程。原文见  Compiling and Optimizing a Model with the Python Interface (AutoTVM)[3] 。准备 TVM
　　首先，安装 TVM。可见文档  Installing TVM[4] ，或笔记「TVM 安装」[5] 。
　　之后，即可通过 TVM Python API 来调优模型。我们先导入如下依赖： import onnx from tvm.contrib.download import download_testdata from PIL import Image import numpy as np import tvm.relay as relay import tvm from tvm.contrib import graph_executor 准备模型，并加载
　　获取预训练的 ResNet-50 v2 ONNX 模型，并加载： model_url = ＂＂.join(     [         ＂https://github.com/onnx/models/raw/＂,         ＂main/vision/classification/resnet/model/＂,         ＂resnet50-v2-7.onnx＂,     ] )  model_path = download_testdata(model_url, ＂resnet50-v2-7.onnx＂, module=＂onnx＂) onnx_model = onnx.load(model_path) 准备图片，并前处理
　　获取一张测试图片，并前处理成 224x224 NCHW 格式： img_url = ＂https://s3.amazonaws.com/model-server/inputs/kitten.jpg＂ img_path = download_testdata(img_url, ＂imagenet_cat.png＂, module=＂data＂)  # Resize it to 224x224 resized_image = Image.open(img_path).resize((224, 224)) img_data = np.asarray(resized_image).astype(＂float32＂)  # Our input image is in HWC layout while ONNX expects CHW input, so convert the array img_data = np.transpose(img_data, (2, 0, 1))  # Normalize according to the ImageNet input specification imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1)) imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1)) norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev  # Add the batch dimension, as we are expecting 4-dimensional input: NCHW. img_data = np.expand_dims(norm_img_data, axis=0) 编译模型，用 TVM Relay
　　TVM 导入 ONNX 模型成 Relay，并创建 TVM 图模型： target = input(＂target [llvm]: ＂) if not target:     target = ＂llvm＂     # target = ＂llvm -mcpu=core-avx2＂     # target = ＂llvm -mcpu=skylake-avx512＂  # The input name may vary across model types. You can use a tool # like Netron to check input names input_name = ＂data＂ shape_dict = {input_name: img_data.shape}  mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)  with tvm.transform.PassContext(opt_level=3):     lib = relay.build(mod, target=target, params=params)  dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib[＂default＂](dev ＂＂default＂＂))
　　其中  target  是目标硬件平台。llvm  指用 CPU，建议指明架构指令集，可更优化性能。如下命令可查看 CPU：$ llc --version | grep CPU   Host CPU: skylake $ lscpu
　　或直接上厂商网站（如  Intel  Products[6] ）查看产品参数。运行模型，用 TVM Runtime
　　用 TVM Runtime 运行模型，进行预测： dtype = ＂float32＂ module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() 收集优化前的性能数据
　　收集优化前的性能数据： import timeit  timing_number = 10 timing_repeat = 10 unoptimized = (     np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))     * 1000     / timing_number ) unoptimized = {     ＂mean＂: np.mean(unoptimized),     ＂median＂: np.median(unoptimized),     ＂std＂: np.std(unoptimized), }  print(unoptimized)
　　之后，用以对比优化后的性能。 后处理输出，得知预测结果
　　输出的预测结果，后处理成可读的分类结果： from scipy.special import softmax  # Download a list of labels labels_url = ＂https://s3.amazonaws.com/onnx-model-zoo/synset.txt＂ labels_path = download_testdata(labels_url, ＂synset.txt＂, module=＂data＂)  with open(labels_path, ＂r＂) as f:     labels = [l.rstrip() for l in f]  # Open the output and read the output tensor scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]:     print(＂class=＂%s＂ with probability=%f＂ % (labels[rank], scores[rank])) 调优模型，获取调优数据
　　于目标硬件平台，用 AutoTVM 自动调优，获取调优数据： import tvm.auto_scheduler as auto_scheduler from tvm.autotvm.tuner import XGBTuner from tvm import autotvm  number = 10 repeat = 1 min_repeat_ms = 0  # since we＂re tuning on a CPU, can be set to 0 timeout = 10  # in seconds  # create a TVM runner runner = autotvm.LocalRunner(     number=number,     repeat=repeat,     timeout=timeout,     min_repeat_ms=min_repeat_ms,     enable_cpu_cache_flush=True, )  tuning_option = {     ＂tuner＂: ＂xgb＂,     ＂trials＂: 10,     ＂early_stopping＂: 100,     ＂measure_option＂: autotvm.measure_option(         builder=autotvm.LocalBuilder(build_func=＂default＂), runner=runner     ),     ＂tuning_records＂: ＂resnet-50-v2-autotuning.json＂, }  # begin by extracting the tasks from the onnx model tasks = autotvm.task.extract_from_program(mod[＂main＂], target=target, params=params)  # Tune the extracted tasks sequentially. for i, task in enumerate(tasks):     prefix = ＂[Task %2d/%2d] ＂ % (i + 1, len(tasks))     tuner_obj = XGBTuner(task, loss_type=＂rank＂)     tuner_obj.tune(         n_trial=min(tuning_option[＂trials＂], len(task.config_space)),         early_stopping=tuning_option[＂early_stopping＂],         measure_option=tuning_option[＂measure_option＂],         callbacks=[             autotvm.callback.progress_bar(tuning_option[＂trials＂], prefix=prefix),             autotvm.callback.log_to_file(tuning_option[＂tuning_records＂]),         ],     )
　　上述  tuning_option  选用的 XGBoost Grid  算法进行优化搜索，数据记录进 tuning_records 。重编译模型，用调优数据
　　重新编译出一个优化模型，依据调优数据： with autotvm.apply_history_best(tuning_option[＂tuning_records＂]):     with tvm.transform.PassContext(opt_level=3, config={}):         lib = relay.build(mod, target=target, params=params)  dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib[＂default＂](dev ＂＂default＂＂))   # Verify that the optimized model runs and produces the same results  dtype = ＂float32＂ module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()  scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]:     print(＂class=＂%s＂ with probability=%f＂ % (labels[rank], scores[rank])) 对比调优与非调优模型
　　收集优化后的性能数据，与优化前的对比： import timeit  timing_number = 10 timing_repeat = 10 optimized = (     np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))     * 1000     / timing_number ) optimized = {＂mean＂: np.mean(optimized), ＂median＂: np.median(optimized), ＂std＂: np.std(optimized)}  print(＂optimized: %s＂ % (optimized)) print(＂unoptimized: %s＂ % (unoptimized))
　　调优模型，整个过程的运行结果，如下： $ time python autotvm_tune.py # TVM 编译运行模型 ## Downloading and Loading the ONNX Model ## Downloading, Preprocessing, and Loading the Test Image ## Compile the Model With Relay target [llvm]: llvm -mcpu=core-avx2 One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. ## Execute on the TVM Runtime ## Collect Basic Performance Data {＂mean＂: 44.97057118016528, ＂median＂: 42.52320024970686, ＂std＂: 6.870915251002107} ## Postprocess the output class=＂n02123045 tabby, tabby cat＂ with probability=0.621104 class=＂n02123159 tiger cat＂ with probability=0.356378 class=＂n02124075 Egyptian cat＂ with probability=0.019712 class=＂n02129604 tiger, Panthera tigris＂ with probability=0.001215 class=＂n04040759 radiator＂ with probability=0.000262 # AutoTVM 调优模型 [Y/n] ## Tune the model [Task  1/25]  Current/Best:  156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done. [Task  2/25]  Current/Best:   54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done. [Task  3/25]  Current/Best:  116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task  4/25]  Current/Best:  119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task  5/25]  Current/Best:   48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done. [Task  6/25]  Current/Best:  156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task  7/25]  Current/Best:   92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done. [Task  8/25]  Current/Best:   50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done. [Task  9/25]  Current/Best:  188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done. [Task 10/25]  Current/Best:   44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done. [Task 11/25]  Current/Best:   83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done. [Task 12/25]  Current/Best:   96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 13/25]  Current/Best:  123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done. [Task 14/25]  Current/Best:   23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done. [Task 15/25]  Current/Best:  119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 16/25]  Current/Best:  101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 17/25]  Current/Best:  309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done. [Task 18/25]  Current/Best:   54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done. [Task 19/25]  Current/Best:   78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done. [Task 20/25]  Current/Best:   40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done. [Task 21/25]  Current/Best:  169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done. [Task 22/25]  Current/Best:   90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done. [Task 23/25]  Current/Best:   48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done. [Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done. [Task 25/25]  Current/Best:   25.50/  33.86 GFLOPS | Progress: (10/10) | 9.28 s Done. ## Compiling an Optimized Model with Tuning Data class=＂n02123045 tabby, tabby cat＂ with probability=0.621104 class=＂n02123159 tiger cat＂ with probability=0.356378 class=＂n02124075 Egyptian cat＂ with probability=0.019712 class=＂n02129604 tiger, Panthera tigris＂ with probability=0.001215 class=＂n04040759 radiator＂ with probability=0.000262 ## Comparing the Tuned and Untuned Models optimized: {＂mean＂: 34.736288779822644, ＂median＂: 34.547542000655085, ＂std＂: 0.5144378649382363} unoptimized: {＂mean＂: 44.97057118016528, ＂median＂: 42.52320024970686, ＂std＂: 6.870915251002107}  real    3m23.904s user    5m2.900s sys     5m37.099s
　　对比性能数据，可以发现：调优模型的运行速度更快、更平稳。 参考笔记:  start-ai-compiler[7] 资料: 2020 / The Deep Learning Compiler: A Comprehensive Survey[8] [译] 深度学习编译器综述[9] 2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning[10] [译] TVM: 一个自动的端到端深度学习优化编译器[11] 脚注
　　[1] Design and Architecture:  https://tvm.apache.org/docs/arch/index.html
　　[2] User Tutorial:  https://tvm.apache.org/docs/tutorial/index.html
　　[3] Compiling and Optimizing a Model with the Python Interface (AutoTVM):  https://tvm.apache.org/docs/tutorial/autotvm_relay_x86.html
　　[4] Installing TVM:  https://tvm.apache.org/docs/tutorial/install.html
　　[5]「TVM 安装」:  https://github.com/ikuokuo/start-ai-compiler/blob/main/docs/tvm/tvm_install.md
　　[6] Intel  Products:  https://www.intel.com/content/www/us/en/products/overview.html
　　[7] start-ai-compiler:  https://github.com/ikuokuo/start-ai-compiler#%E7%AC%94%E8%AE%B0
　　[8] 2020 / The Deep Learning Compiler: A Comprehensive Survey:  https://arxiv.org/abs/2002.03794
　　[9] [译] 深度学习编译器综述:  https://www.jianshu.com/p/ed372af7ef09
　　[10] 2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning:  https://www.usenix.org/conference/osdi18/presentation/chen
　　[11] [译] TVM: 一个自动的端到端深度学习优化编译器:  https://zhuanlan.zhihu.com/p/426994569

原今日头条与今日头条极速版有什么区别？是重复的吗？今日头条普通版和今日头条极速版是有区别的，并不能说重复。二者在功能定位面向人群方面存在着差异。普通版的功能比较全面一些普通版不但能浏览资讯，还融合了火山抖音懂车帝悟空问答等一些其他华为今日新闻（11月7日）早安！今天是11月7日，星期日，农历十月初三今日新闻概要一沙特传喜讯，华为获得5G大单二东盟基金会与华为合作缩减亚太数字人才差距三保加利亚加布罗沃技术大学与华为公司合作四港口军团显美国半导体脸丢大了！英特尔CEO亲口承认，技术落后台积电10年英特尔是半导体行业的霸主，在CPU领域占据主导地位。然而在芯片制造领域，英特尔却越发落后，其仍停留在10nm制程工艺。而台积电三星都已经实现5nm芯片量产，并向3nm迈进。英特尔首新能源车型的优势与未来市场近年来汽车已经成为家家户户的必需品。越来越多的人选择购买私家汽车，只为了图一个方便。而由于国内新能源汽车市场飞速发展，路边看到的绿色牌照汽车较往年多了许多。新能源汽车使用情况逐年增原来平板还有这么多使用技巧，涨知识了近年来，随着安卓平板的崛起，市场上涌现了不少优秀的国产平板，从学习到办公从居家到户外各种新品层出不穷。那究竟什么样的一款平板才能从中脱颖而出？那不如就让联想YOGAPadPro来告教授聊错过暴富机会很多暴富机会真和知识无关，很多人就是勇敢8月19日消息，前快播创始人王欣与储殷近日在访谈节目中谈到，自己曾错过巨大暴富机会，是最早研究比特币却因耗电放弃。对此，储殷表示，很多暴富机会真的跟知识无关，往往是很多人就是勇敢。2021年7月新能源汽车各城市销量排行榜2021年7月新能源汽车零售销量全国城市排行出炉上海位居第一，北京位居第二深圳第三，广州第四杭州第五成都第六天津第七重庆第八郑州第九苏州第十7月新能源乘用车销量达到21。66万台，现在还不开启iPhone手机这5个功能，感觉几千块的手机白买了iPhone手机很多人在用，但是褒贬不一，有人觉得很好，有人觉得差强人意。如果你也觉得iPhone手机使用起来，并没有宣传得那么好，可能是这5个实用的功能还没有开启。现在还不开启i越南建厂梦醒，苹果带着富士康撤回中国，越南为何难成世界工厂？在我看来有这么几个因素。首先，中国人民吃苦耐劳肯干，这是世界公认的优良传统。其次，中国能成为世界的工厂，是和我们的国情，国家发展的规划，以及政策扶持有很大的关系的。再次，相比其他国谁是销量冠军？手机618战报汇总第1阶段对于不少人来说，618双11所代表的意义除了原本的日期表述外，现在也有了另一层与购物相关的联系。进入6月，虽然还没有正式来到大家以往熟悉的6月18日，但各平台的购物销售活动却早已正对于新能源汽车未来发展，政府和科研学术高层是存在着较大的争议其实对于新能源未来发展，政府和科研学术高层是存在着较大的争议的，争议的点主要在于新能源未来走向到底是什么，衣宝廉认为燃料电池方向将会是未来新能源主要的走向，杨裕生认为增程式混动才是

<<<<<<－>>>>>>

NFC功能真的实用？卢伟冰大力推行，红米千元机也会标配科技进步的脚步从未停下，随着技术的提升，我们的日常生活方式也有了很大改善，就拿智能手机来说，各种各样的新功能出现，当大家适应之后竟然发现意外的好用，并且一直保留至今，比如红外遥控和老妈要换一款长续航千元机千元机推荐第一波老妈今天说要换一款千元新手机，要求续航好耐用就行真我Q3立马浮现在我的脑子里面。oppo品牌是国产手机中的佼佼者，通过多年发展，这个品牌在品控上做到了极致，要说国产手机品质那还真得小米OV都能用？华为鸿蒙OS将开源，部分功能比安卓更强点击右上方关注，第一时间获取科技资讯技能攻略产品体验，私信我回复01，送你一份玩机技能大礼包。2019年，华为正式公布harmonyOS，当时就对外称，系统将开源。5月24日，在鸿明确了！数字人民币不会替代支付宝，你会怎么选择？关注币圈的人们可能非常清楚，作为人民币数字化的产物，数字人民币自从2014年以来就不断有消息传来，但从2020年起，随着数字人民币在深圳苏州等城市开展小范围测试，它的脚步也离我们越靠消费贾跃亭，孙宏斌能复活乐视吗？文布谷编辑李信近期，乐视开了一场发布会，乐视智能生态高级市场总监吴国平表示，这本是一次中小型的发布会，但不料因为舆论的宣传，让原本人数并不多的现场变得拥挤。造成这一次乐观发布会人数是不是小米性价比最高？只是部分机型最高的从硬件来看，小米的确是部分产品性价比最高的。如小米11RedmiK系列（今年的RedmiK40）以及RedmiNote系列等。小米是靠MIUI系统和高性价比手机起家的。这也是小米的一夜之间血本无归，现在的币圈太可怕了这几天，币圈掀起了一场腥风血雨。5月19日，比特币一度跌破3万美元，暴跌近30。超过57万人爆仓，爆仓金额达443亿元。虽然后续回升至3。9万美元左右，但一夜暴负是跑不掉的了。炒币泡面盖已成过去式，iPadPro将与MacBook平起平坐在智能手机刚起步的阶段，苹果率先带来了iPad，为掌上影音娱乐提供了更多可能性。然而随着手机屏幕尺寸的不断增加，iPad的优势被不断削弱，使用场景也越来越少，这才有了买前生产力，买抖音快手等105款App被通报近期，针对人民群众反映强烈的App非法获取超范围收集过度索权等侵害个人信息的现象，国家互联网信息办公室依据中华人民共和国网络安全法App违法违规收集使用个人信息行为认定方法常见类型Edge新特性新增迷你右键菜单内置在线词典优化密码监控基于Chromium的新版Edge浏览器在近日的更新中，终于迁移了经典版Edge浏览器中的在线字典功能。此外在最新版本中，微软似乎正在开发迷你版右键菜单，并对密码监控工具进行改善。华为喊话了？小米的选择很关键谷歌中断与华为合作让其操作系统面临巨大难题。不过已经有所准备的华为推出了鸿蒙系统用以应对，虽然并没有第一时间上线，但也稳住了不少用户。至少在2020年前期华为依然能够取得骄人的销售