紧跟恺明的步伐记录一下复现行为识别slowfast模型的全流程

　　作者丨付辉辉、周钰臣
　　编辑丨极市平台 前言
　　近年来，基于深度学习的人体动作识别的研究越来越多， slowfast  模型提出了快慢两通道网络在动作识别数据集上表现十分优异，本文介绍了 Slowfast  数据准备，如何训练，以及 slowfast  使用 onnx  进行推理，着重介绍了 Slowfast  使用 Tensorrt  推理，并且使用 yolov5  和 deepsort  进行人物追踪，以及使用 C++   部署。 1.数据准备1.1 剪裁视频
　　准备多组视频数据，其中 IN_DATA_DIR   为原始视频数据存放目录， OUT_DATA_DIR  为目标视频数据存放目录。这一步保证所有视频长度相同 IN_DATA_DIR=＂/project/train/src_repo/data/video＂ OUT_DATA_DIR=＂/project/train/src_repo/data/splitvideo＂ str=＂_＂ if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi  for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do      for i in {0..10}     do        index=$(expr $i * 10)       out_name=＂${OUT_DATA_DIR}/${i}${str}${video##*/}＂       if [ ! -f ＂${out_name}＂ ]; then         ffmpeg -ss ${index} -t 80 -i ＂${video}＂ ＂${out_name}＂       fi     done done 1.2 提取关键帧
　　关键帧是从视频每一秒中提取一帧， IN_DATA_DIR  为步骤一得到视频的目录， OUT_DATA_DIR  为提取的关键帧的存放目录 #切割图片，每秒1帧 IN_DATA_DIR=＂/project/train/src_repo/data/splitvideo/＂ OUT_DATA_DIR=＂/project/train/src_repo/data/splitimages/＂   if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi   for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do   video_name=${video##*/}     if [[ $video_name = *＂.webm＂ ]]; then     video_name=${video_name::-5}   else     video_name=${video_name::-4}   fi     out_video_dir=${OUT_DATA_DIR}/${video_name}/   mkdir -p ＂${out_video_dir}＂     out_name=＂${out_video_dir}/${video_name}_%06d.jpg＂     ffmpeg -i ＂${video}＂ -r 1 -q:v 1 ＂${out_name}＂ done   1.3 分割视频
　　将步骤一生成的视频通过 ffmpeg  进行分帧，每秒30帧， IN_DATA_DIR  为存放视频目录， OUT_DATA_DIR  为存放结果目录 IN_DATA_DIR=＂/project/train/src_repo/video＂ OUT_DATA_DIR=＂/project/train/src_repo/spiltvideo＂  if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi  for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do   out_name=＂${OUT_DATA_DIR}/${video##*/}＂   if [ ! -f ＂${out_name}＂ ]; then     ffmpeg -ss 0 -t 100 -i ＂${video}＂ ＂${out_name}＂   fi done 1.4 文件目录ava  #一级文件夹，用来存放视频信息 —person_box_67091280_iou90 #二级文件夹，用来存放目标检测信息文件夹 ——ava_detection_train_boxes_and_labels_include_negative_v2.2.csv #二级文件夹下文件，用来存放目标检测的信息，用于训练 ——ava_detection_val_boxes_and_labels.csv #二级文件夹下文件，用来存放目标检测的信息，用于测试 —ava_action_list_v2.2_for_activitynet_2019.pbtxt #一级文件夹下的文件，用来存放标签信息 —ava_val_excluded_timestamps_v2.2.csv #一级文件夹下的文件，用来没有人物的帧，在训练过程中会抛弃这些帧 —ava_train_v2.2.csv #一级文件夹下的文件，用来存放训练数据，关键帧的信息 —ava_val_v2.2.csv  #一级文件夹下的文件，用来存放验证数据，关键帧的信息  frame_lists  #一级文件夹，存放1.3中生成的图片的路径 —train.csv —val.csv  frames  #一级文件夹，存放1.3中生成的图片 —A ——A_000001.jpg ——A_0000012.jpg … ——A_000090.jpg —B ——B_000001.jpg ——B_0000012.jpg … ——B_000090.jpg 2.环境准备2.1 环境准备pip install iopath pip install fvcore pip install simplejson pip install pytorchvideo 2.2detectron2安装!python -m pip install pyyaml==5.1 import sys, os, distutils.core # Note: This is a faster way to install detectron2 in Colab, but it does not include all functionalities. # See https://detectron2.readthedocs.io/tutorials/install.html for full installation instructions !git clone ＂https://github.com/facebookresearch/detectron2＂ dist = distutils.core.run_setup(＂./detectron2/setup.py＂) !python -m pip install {＂ ＂.join([f＂＂{x}＂＂ for x in dist.install_requires])} sys.path.insert(0, os.path.abspath(＂./detectron2＂)) 3.slowfast训练3.1 训练python tools/run_net.py --cfg configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml
　　SLOWFAST_32x2_R50_SHORT.yaml  TRAIN:   ENABLE: Fasle   DATASET: ava   BATCH_SIZE: 8 #64   EVAL_PERIOD: 5   CHECKPOINT_PERIOD: 1   AUTO_RESUME: True   CHECKPOINT_FILE_PATH: ＂/content/SLOWFAST_32x2_R101_50_50.pkl＂  #预训练模型地址   CHECKPOINT_TYPE: pytorch DATA:   NUM_FRAMES: 32   SAMPLING_RATE: 2   TRAIN_JITTER_SCALES: [256, 320]   TRAIN_CROP_SIZE: 224   TEST_CROP_SIZE: 224   INPUT_CHANNEL_NUM: [3, 3]   PATH_TO_DATA_DIR: ＂/content/ava＂ DETECTION:   ENABLE: True   ALIGNED: True AVA:   FRAME_DIR: ＂/content/ava/frames＂   #数据准备阶段生成的目录   FRAME_LIST_DIR: ＂/content/ava/frame_lists＂   ANNOTATION_DIR: ＂/content/ava/annotations＂   DETECTION_SCORE_THRESH: 0.5   FULL_TEST_ON_VAL: True   TRAIN_PREDICT_BOX_LISTS: [     ＂ava_train_v2.2.csv＂,     ＂person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv＂,   ]   TEST_PREDICT_BOX_LISTS: [     ＂person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv＂]       SLOWFAST:   ALPHA: 4   BETA_INV: 8   FUSION_CONV_CHANNEL_RATIO: 2   FUSION_KERNEL_SZ: 7 RESNET:   ZERO_INIT_FINAL_BN: True   WIDTH_PER_GROUP: 64   NUM_GROUPS: 1   DEPTH: 50   TRANS_FUNC: bottleneck_transform   STRIDE_1X1: False   NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]   SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]   SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]] NONLOCAL:   LOCATION: [[[], []], [[], []], [[], []], [[], []]]   GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]   INSTANTIATION: dot_product   POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]] BN:   USE_PRECISE_STATS: False   NUM_BATCHES_PRECISE: 20 SOLVER:   BASE_LR: 0.1   LR_POLICY: steps_with_relative_lrs   STEPS: [0, 10, 15, 20]   LRS: [1, 0.1, 0.01, 0.001]   MAX_EPOCH: 20   MOMENTUM: 0.9   WEIGHT_DECAY: 1e-7   WARMUP_EPOCHS: 5.0   WARMUP_START_LR: 0.000125   OPTIMIZING_METHOD: sgd MODEL:   NUM_CLASSES: 1   ARCH: slowfast   MODEL_NAME: SlowFast   LOSS_FUNC: bce   DROPOUT_RATE: 0.5   HEAD_ACT: sigmoid TEST:   ENABLE: False   DATASET: ava   BATCH_SIZE: 8 DATA_LOADER:   NUM_WORKERS: 0   PIN_MEMORY: True NUM_GPUS: 1 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUT_DIR: . 3.2 训练过程常见报错
　　1. slowfast/datasets/ava_helper.py   中 AVA_VALID_FRAMES  改为你的视频长度
　　2. pytorchvideo.layers.distributed  报错 from pytorchvideo.layers.distributed import ( # noqa ImportError: cannot import name ＂cat_all_gather＂ from ＂pytorchvideo.layers.distributed＂  (/site-packages/pytorchvideo/layers/distributed.py)
　　3. pytorchvideo.losses   报错 File ＂SlowFast/slowfast/models/losses.py＂, line 11, in from pytorchvideo.losses.soft_target_cross_entropy import ( ModuleNotFoundError: No module named ＂pytorchvideo.losses＂
　　错误2，3可以通过查看参考链接一来解决 4.slowfast预测
　　第一种：使用官方的脚本进行推理 python tools/run_net.py --cfg demo/AVA/SLOWFAST_32x2_R101_50_50.yaml
　　第二种：由于 detectron2  安装问题，以及之后部署一系列的问题，可以使用 yolov5  加上 slowfast  进行推理
　　首先，先来了解 slowfast  的推理过程
　　Step1：连续读取64帧并且判断是否满足64帧 while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)
　　Step2: 使用yolov5进行目标检测
　　1. yolov5   推理代码，将 sys.path.insert  路径和权重路径 weights  进行更改 import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.6 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     return torch.from_numpy(np.array(result))
　　2 .bbox   预处理 def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes
　　Step3: 图像预处理
　　1. Resize   图像尺寸 def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32)
　　2.归一化 def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor
　　3.构建 slow  以及 fast   输入数据
　　主要思路为从64帧图像数据中选取32帧作为 fast  的输入，再从 fast  中选取8帧作为 slow  的输入，并将  T H W C -> C T H W  .因此最后 fast_pathway  维度为 (b,3,32,h,w)    slow_pathway  的维度为 (b,3,8,h,w)  def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs 5.slowfast onnx推理5.1 导出onnx文件import os import sys from collections import OrderedDict import torch import argparse work_root = os.path.split(os.path.realpath(__file__))[0] from slowfast.config.defaults import get_cfg import slowfast.utils.checkpoint as cu from slowfast.models import build_model   def parser_args():     parser = argparse.ArgumentParser()     parser.add_argument(         ＂--cfg＂,         dest=＂cfg_file＂,         type=str,         default=os.path.join(             work_root, ＂/content/drive/MyDrive/SlowFast/demo/AVA/SLOWFAST_32x2_R101_50_50.yaml＂),         help=＂Path to the config file＂,     )     parser.add_argument(         ＂--half＂,         type=bool,         default=False,         help=＂use half mode＂,     )     parser.add_argument(         ＂--checkpoint＂,         type=str,         default=os.path.join(work_root,                              ＂/content/SLOWFAST_32x2_R101_50_50.pkl＂),         help=＂test model file path＂,     )     parser.add_argument(         ＂--save＂,         type=str,         default=os.path.join(work_root, ＂/content/SLOWFAST_head.onnx＂),         help=＂save model file path＂,     )     return parser.parse_args()   def main():     args = parser_args()     print(args)     cfg_file = args.cfg_file     checkpoint_file = args.checkpoint     save_checkpoint_file = args.save     half_flag = args.half     cfg = get_cfg()     cfg.merge_from_file(cfg_file)     cfg.TEST.CHECKPOINT_FILE_PATH = checkpoint_file     print(cfg.DATA)     print(＂export pytorch model to onnx! ＂)     device = ＂cuda:0＂     with torch.no_grad():         model = build_model(cfg)         model = model.to(device)         model.eval()         cu.load_test_checkpoint(cfg, model)         if half_flag:             model.half()         fast_pathway= torch.randn(1, 3, 32, 256, 455)         slow_pathway= torch.randn(1, 3, 8, 256, 455)         bbox=torch.randn(32,5).to(device)         fast_pathway = fast_pathway.to(device)         slow_pathway = slow_pathway.to(device)         inputs = [slow_pathway, fast_pathway]         for p in model.parameters():          p.requires_grad = False         torch.onnx.export(model, (inputs,bbox), save_checkpoint_file, input_names=[＂slow_pathway＂,＂fast_pathway＂,＂bbox＂],output_names=[＂output＂], opset_version=12)         onnx_check()   def onnx_check():     import onnx     args = parser_args()     print(args)     onnx_model_path = args.save     model = onnx.load(onnx_model_path)     onnx.checker.check_model(model)   if __name__ == ＂__main__＂:     main() 5.2onnx推理import torch import math import onnxruntime from torchvision.ops import roi_align import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.6 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     for i in range(32-len(result)):       result.append([float(0),float(0),float(0),float(0)])     return torch.from_numpy(np.array(result)) def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32) def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs #加载模型 yolov5=init() slowfast = onnxruntime.InferenceSession(＂/content/SLOWFAST_32x2_R101_50_50.onnx＂) #加载数据开始推理 cap = cv2.VideoCapture(＂/content/atm_125.mp4＂) was_read=True while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)          bboxes = process_image(yolov5,frames[64//2])     if bboxes is not None:       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]       frames = [scale(256, frame) for frame in frames]       inputs = process_cv2_inputs(frames)       if bboxes is not None:           bboxes = scale_boxes(256,bboxes,1080,1920)           index_pad = torch.full(               size=(bboxes.shape[0], 1),               fill_value=float(0),               device=bboxes.device,           )           # Pad frame index for each box.           bboxes = torch.cat([index_pad, bboxes], axis=1)       for i in range(len(inputs)):         inputs[i] = inputs[i].numpy()       if bboxes is not None:           outputs = slowfast.run(None, {＂slow_pathway＂: inputs[0],＂fast_pathway＂:inputs[1],＂bbox＂:bboxes})           for i in range(80):             if outputs[0][0][i]>0.3:               print(i)           print(np.shape(prd))     else:         print(＂没有检测到任何人物＂) 6slowfastpythonTensorrt推理6.1 导出Tensorrt
　　接下来，为本文的创新点
　　一开始，本文尝试使用直接将 onnx  导出为 Tensorrt  ，导出失败，查找原因是因为 roi_align  在 Tensorrt  中还未实现（ roi_align   将在下个版本的 Tensorrt  中实现）。
　　查看导出的 onnx  图，会发现 roi_align  只在 head  部分用到。
　　于是我们提出以下思路，如下图所示，将 roi_ailgn  模块单独划分出来，不经过 Tensorrt  加速，将 slowfast  分成为两个网络，其中主体网络用于提取特征， head  网络部分负责进行动作分类.。
　　6.2Tensorrt推理代码import ctypes import os import numpy as np import cv2 import random import tensorrt as trt import pycuda.autoinit import pycuda.driver as cuda import threading import time   class TrtInference():     _batch_size = 1     def __init__(self, model_path=None, cuda_ctx=None):         self._model_path = model_path         if self._model_path is None:             print(＂please set trt model path!＂)             exit()         self.cuda_ctx = cuda_ctx         if self.cuda_ctx is None:             self.cuda_ctx = cuda.Device(0).make_context()         if self.cuda_ctx:             self.cuda_ctx.push()         self.trt_logger = trt.Logger(trt.Logger.INFO)         self._load_plugins()         self.engine = self._load_engine()         try:             self.context = self.engine.create_execution_context()             self.stream = cuda.Stream()             for index, binding in enumerate(self.engine):                 if self.engine.binding_is_input(binding):                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()                     batch_shape[0] = self._batch_size                     self.context.set_binding_shape(index, batch_shape)             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()         except Exception as e:             raise RuntimeError(＂fail to allocate CUDA resources＂) from e         finally:             if self.cuda_ctx:                 self.cuda_ctx.pop()      def _load_plugins(self):         pass      def _load_engine(self):         with open(self._model_path, ＂rb＂) as f, trt.Runtime(self.trt_logger) as runtime:             return runtime.deserialize_cuda_engine(f.read())      def _allocate_buffers(self):         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings =              [], [], [], [], []         for index, binding in enumerate(self.engine):             size = trt.volume(self.context.get_binding_shape(index)) *                     self.engine.max_batch_size             host_mem = cuda.pagelocked_empty(size, np.float32)             cuda_mem = cuda.mem_alloc(host_mem.nbytes)             bindings.append(int(cuda_mem))             if self.engine.binding_is_input(binding):                 host_inputs.append(host_mem)                 cuda_inputs.append(cuda_mem)             else:                 host_outputs.append(host_mem)                 cuda_outputs.append(cuda_mem)         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings      def destroy(self):         ＂＂＂Free CUDA memories and context.＂＂＂         del self.cuda_outputs         del self.cuda_inputs         del self.stream         if self.cuda_ctx:             self.cuda_ctx.pop()             del self.cuda_ctx      def inference(self, inputs):         np.copyto(self.host_inputs[0], inputs[0].ravel())         np.copyto(self.host_inputs[1], inputs[1].ravel())         if self.cuda_ctx:             self.cuda_ctx.push()         cuda.memcpy_htod_async(             self.cuda_inputs[0], self.host_inputs[0], self.stream)         cuda.memcpy_htod_async(             self.cuda_inputs[1], self.host_inputs[1], self.stream)         self.context.execute_async(             batch_size=1,             bindings=self.bindings,             stream_handle=self.stream.handle)         cuda.memcpy_dtoh_async(             self.host_outputs[0], self.cuda_outputs[0], self.stream)         cuda.memcpy_dtoh_async(             self.host_outputs[1], self.cuda_outputs[1], self.stream)         self.stream.synchronize()         if self.cuda_ctx:             self.cuda_ctx.pop()         output = [self.host_outputs[0],self.host_outputs[1]]         return output   class TrtInference_head():     _batch_size = 1     def __init__(self, model_path=None, cuda_ctx=None):         self._model_path = model_path         if self._model_path is None:             print(＂please set trt model path!＂)             exit()         self.cuda_ctx = cuda_ctx         if self.cuda_ctx is None:             self.cuda_ctx = cuda.Device(0).make_context()         if self.cuda_ctx:             self.cuda_ctx.push()         self.trt_logger = trt.Logger(trt.Logger.INFO)         self._load_plugins()         self.engine = self._load_engine()         try:             self.context = self.engine.create_execution_context()             self.stream = cuda.Stream()             for index, binding in enumerate(self.engine):                 if self.engine.binding_is_input(binding):                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()                     batch_shape[0] = self._batch_size                     self.context.set_binding_shape(index, batch_shape)             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()         except Exception as e:             raise RuntimeError(＂fail to allocate CUDA resources＂) from e         finally:             if self.cuda_ctx:                 self.cuda_ctx.pop()      def _load_plugins(self):         pass      def _load_engine(self):         with open(self._model_path, ＂rb＂) as f, trt.Runtime(self.trt_logger) as runtime:             return runtime.deserialize_cuda_engine(f.read())      def _allocate_buffers(self):         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings =              [], [], [], [], []         for index, binding in enumerate(self.engine):             size = trt.volume(self.context.get_binding_shape(index)) *                     self.engine.max_batch_size             host_mem = cuda.pagelocked_empty(size, np.float32)             cuda_mem = cuda.mem_alloc(host_mem.nbytes)             bindings.append(int(cuda_mem))             if self.engine.binding_is_input(binding):                 host_inputs.append(host_mem)                 cuda_inputs.append(cuda_mem)             else:                 host_outputs.append(host_mem)                 cuda_outputs.append(cuda_mem)         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings      def destroy(self):         ＂＂＂Free CUDA memories and context.＂＂＂         del self.cuda_outputs         del self.cuda_inputs         del self.stream         if self.cuda_ctx:             self.cuda_ctx.pop()             del self.cuda_ctx      def inference(self, inputs):         np.copyto(self.host_inputs[0], inputs[0].ravel())         np.copyto(self.host_inputs[1], inputs[1].ravel())         if self.cuda_ctx:             self.cuda_ctx.push()         cuda.memcpy_htod_async(             self.cuda_inputs[0], self.host_inputs[0], self.stream)         cuda.memcpy_htod_async(             self.cuda_inputs[1], self.host_inputs[1], self.stream)         self.context.execute_async(             batch_size=1,             bindings=self.bindings,             stream_handle=self.stream.handle)         cuda.memcpy_dtoh_async(             self.host_outputs[0], self.cuda_outputs[0], self.stream)         self.stream.synchronize()         if self.cuda_ctx:             self.cuda_ctx.pop()         output = self.host_outputs[0]         return output  import torch import math from torchvision.ops import roi_align import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.89 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     for i in range(32-len(result)):       result.append([float(0),float(0),float(0),float(0)])     return torch.from_numpy(np.array(result)) def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32) def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs #加载模型 yolov5=init() slowfast = TrtInference(＂/content/SLOWFAST_32x2_R101_50_50.engine＂,None) head = TrtInference_head(＂/content/SLOWFAST_head.engine＂,None)  #加载数据开始推理 cap = cv2.VideoCapture(＂/content/atm_125.mp4＂) was_read=True while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)          bboxes = process_image(yolov5,frames[64//2])     if bboxes is not None:       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]       frames = [scale(256, frame) for frame in frames]       inputs = process_cv2_inputs(frames)       print(bboxes)       if bboxes is not None:           bboxes = scale_boxes(256,bboxes,1080,1920)           index_pad = torch.full(               size=(bboxes.shape[0], 1),               fill_value=float(0),               device=bboxes.device,           )           # Pad frame index for each box.           bboxes = torch.cat([index_pad, bboxes], axis=1)       for i in range(len(inputs)):         inputs[i] = inputs[i].numpy()       if bboxes is not None:           outputs=slowfast.inference(inputs)           outputs[0]=outputs[0].reshape(1,2048,16,29)           outputs[1]=outputs[1].reshape(1,256,16,29)           outputs[0]=torch.from_numpy(outputs[0])           outputs[1]=torch.from_numpy(outputs[1])           outputs[0]=roi_align(outputs[0],bboxes.to(dtype=outputs[0].dtype),7,1.0/16,0,True)           outputs[1]=roi_align(outputs[1],bboxes.to(dtype=outputs[1].dtype),7,1.0/16,0,True)           outputs[0] = outputs[0].numpy()           outputs[1] = outputs[1].numpy()           prd=head.inference(outputs)           prd=prd.reshape(32,80)           for i in range(80):             if prd[0][i]>0.3:               print(i)     else:         print(＂没有检测到任何人物＂)
　　通过阅读上述的代码
　　slow_pathway   与 fast_pathway   经过 slowfast  主体模型，通过 reshape  成 roi_align   需要的维度，将 reshape  后的结果， bbox  以及相应的参数带入到 roi_align  中得到 head  模型需要的输入。 7.slowfastC++tensorrt部署7.1yolov5C++目标检测
　　yolov5   本文就不介绍了，我直接使用平台自带的 yolov5    tensorrt   代码 https://github.com/ExtremeMart/ev_sdk_demo4.0_pedestrian_intrusion_yolov5 7.2deepsortC++目标追踪
　　本文参考以下的 deepsort  代码 https://github.com/RichardoMrMu/deepsort-tensorrt
　　由于这部分不是本文的重点，只需要知道怎么使用这部分的代码，写好CmakeLists文件，在代码中可以按照以下的方式使用 deepsort  #include ＂deepsort.h＂  /**  DeepSortBox 为yolov5识别的结果  DeepSortBox 结构  {   x1,   y1,   x2,   y2,   score,   label,   trackID  }  img 为原始的图片  最终结果存放在DeepSortBox中 */ DS->sort(img, DeepSortBox);  7.3slowfastC++目标动作识别
　　运行环境：
　　Tensorrt8.4
　　opencv4.1.1
　　cudnn8.0
　　cuda11.1
　　文件准备：
　　body.onnx
　　head.onnx
　　slowfast推理流程图
　　我们还是按照预测的流程图来实现 Tensorrt  推理代码
　　通过 onnx  可视化查看 body.onnx  输入以及输出
　　head.onnx  的输入以及输出
　　Step1：模型加载
　　将 body.onnx  以及 head.onnx   通过 Tensorrt  加载，并且开辟 Tensorrt  推理运行空间，代码如下 void loadheadOnnx(const std::string strModelName) {     Logger gLogger;     //根据tensorrt pipeline 构建网络     IBuilder* builder = createInferBuilder(gLogger);     builder->setMaxBatchSize(1);     const auto explicitBatch = 1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);       INetworkDefinition* network = builder->createNetworkV2(explicitBatch);     nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, gLogger);     parser->parseFromFile(strModelName.c_str(), static_cast(ILogger::Severity::kWARNING));     IBuilderConfig* config = builder->createBuilderConfig();     config->setMaxWorkspaceSize(1ULL << 30);         m_CudaheadEngine = builder->buildEngineWithConfig(*network, *config);          std::string strTrtName = strModelName;     size_t sep_pos = strTrtName.find_last_of(＂.＂);     strTrtName = strTrtName.substr(0, sep_pos) + ＂.trt＂;     IHostMemory *gieModelStream = m_CudaheadEngine->serialize();     std::string serialize_str;     std::ofstream serialize_output_stream;     serialize_str.resize(gieModelStream->size());        memcpy((void*)serialize_str.data(),gieModelStream->data(),gieModelStream->size());     serialize_output_stream.open(strTrtName.c_str());     serialize_output_stream<createExecutionContext();     parser->destroy();     network->destroy();     config->destroy();     builder->destroy(); }
　　Step2: 为输入输出数据开辟空间
　　body.onnx   输入为 slow_pathway  和 fast_pathway  的维度为 (B,C,T,H,W)  ，其中 slow_pathway  的T为8，输出为 (B,2048,16,29)  ， fast_pathway  的维度为32，输出为 (B,256,16,29)``,head  的输入(32,2048,7,7)与(32,256,7,7)，输出为(32,80),具体代码实现如下：  slow_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_NAME);     fast_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_NAME);     slow_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_OUTPUT);     fast_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_OUTPUT);      dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_InputIndex);     SDKLOG(INFO)<getBindingDimensions(fast_pathway_InputIndex);     SDKLOG(INFO) << ＂fast_pathway dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3]<< ＂ ＂ << dims_i.d[4];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_InputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[fast_pathway_InputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[fast_pathway_InputIndex]=size* sizeof(float);               dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_OutputIndex);     SDKLOG(INFO) << ＂slow_out dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];     cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_OutputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[slow_pathway_OutputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[slow_pathway_OutputIndex]=size* sizeof(float);                    dims_i = m_CudaslowfastEngine->getBindingDimensions(fast_pathway_OutputIndex);     SDKLOG(INFO) << ＂fast_out dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_OutputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[fast_pathway_OutputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[fast_pathway_OutputIndex]=size* sizeof(float);                    size=32*2048*7*7;     cudaMalloc(&ROIAlign_ArrayDevMemory[0], size * sizeof(float));     ROIAlign_ArrayHostMemory[0] = malloc(size * sizeof(float));     ROIAlign_ArraySize[0]=size* sizeof(float);          size=32*256*7*7;     cudaMalloc(&ROIAlign_ArrayDevMemory[1], size * sizeof(float));     ROIAlign_ArrayHostMemory[1] = malloc(size * sizeof(float));     ROIAlign_ArraySize[1]=size* sizeof(float);               size=32*80;     cudaMalloc(&ROIAlign_ArrayDevMemory[2], size * sizeof(float));     ROIAlign_ArrayHostMemory[2] = malloc(size * sizeof(float));     ROIAlign_ArraySize[2]=size* sizeof(float);     size=32*5;     boxes_data= malloc(size * sizeof(float));     dims_i = m_CudaheadEngine->getBindingDimensions(0);
　　Step3：输入数据预处理
　　首先由于我导出 onnx  文件没有使用动态尺寸，导致input 图片大小已经确定了， size=256*455  (这个结果是1080*1920等比例放缩)， slowfast  模型要求为 RGB  ，需要将图片从 BGR  转换为 RGB  ，之后进行 resize  到256*455，具体代码实现如下   cv::Mat framesimg = img.clone();         cv::cvtColor(framesimg, framesimg, cv::COLOR_BGR2RGB);         int height = framesimg.rows;         int width = framesimg.cols;         // 对图像进行预处理         //cv2.COLOR_BGR2RGB         int size=256;         int new_width = width;         int new_height = height;         if ((width <= height && width == size) || (height <= width and height == size)){                      }         else{             new_width = size;             new_height = size;             if(width(h, w)[c]) / 255.0f;                     v -= 0.45;                     v /= 0.225;                     data[c*32*256*455+fast_index* new_width * new_height + h * new_width + w] =v;                 }             }         }         fast_index++;         if(frames==0||frames==8||frames==16||frames==26||frames==34||frames==44||frames==52||frames==63){             data=(float *)slowfast_ArrayHostMemory[slow_pathway_InputIndex];             for (size_t c = 0; c < 3; c++)             {                 for (size_t  h = 0; h < new_height; h++)                 {                     for (size_t w = 0; w < new_width; w++)                     {                        float v=((float)framesimg.at(h, w)[c]) / 255.0f;                         v -= 0.45;                         v /= 0.225;                         data[c*8*256*455+slow_index* new_width * new_height + h * new_width + w] =v;                     }                 }             }               slow_index++;         }
　　Step4:  roi_align  实现
　　正如上一节所描述一样，roi_align在当前版本中的Tensorrt中并没有实现，而在torchvision.ops中实现了roi_align，python推理代码可以直接调用。而C++代码必须要实现roi_align，具体原理这里不讲解了，可以简单认为roi_align具体过程就是crop和resize的过程，从特征图中提取bbox对应的特征，将提取到的特征resize到7*7。具体代码实现如下 void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,                      const int height, const int width, const int channels,                      const int aligned_height, const int aligned_width, const float * bottom_rois,                      float* top_data) {     const int output_size = num_rois * aligned_height * aligned_width * channels;      int idx = 0;     for (idx = 0; idx < output_size; ++idx)     {         int pw = idx % aligned_width;         int ph = (idx / aligned_width) % aligned_height;         int c = (idx / aligned_width / aligned_height) % channels;         int n = idx / aligned_width / aligned_height / channels;            float roi_batch_ind = 0;          float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;         float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;         float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;         float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;          float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);         float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);         float bin_size_h = roi_height / (aligned_height - 1.);         float bin_size_w = roi_width / (aligned_width - 1.);          float h = (float)(ph) * bin_size_h + roi_start_h;         float w = (float)(pw) * bin_size_w + roi_start_w;          int hstart = fminf(floor(h), height - 2);          int wstart = fminf(floor(w), width - 2);          int img_start = roi_batch_ind * channels * height * width;          if (h < 0 || h >= height || w < 0 || w >= width)           {             top_data[idx] = 0.;          }         else         {             float h_ratio = h - (float)(hstart);              float w_ratio = w - (float)(wstart);             int upleft = img_start + (c * height + hstart) * width + wstart;                          int upright = upleft + 1;             int downleft = upleft + width;              int downright = downleft + 1;               top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)                 + bottom_data[upright] * (1. - h_ratio) * w_ratio                 + bottom_data[downleft] * h_ratio * (1. - w_ratio)                 + bottom_data[downright] * h_ratio * w_ratio;           }     } }
　　Step5：推理
　　首先将  Step3  中准备好的数据使用 body  进行推理，将推理结果使用 Step4  中的 roi_align  函数进行提取 bbox  对应的特征，最后将提取的特征使用 head  模型进行推理，得到 output  。具体代码实现如下 cudaMemcpyAsync(slowfast_ArrayDevMemory[slow_pathway_InputIndex], slowfast_ArrayHostMemory[slow_pathway_InputIndex], slowfast_ArraySize[slow_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);     cudaMemcpyAsync(slowfast_ArrayDevMemory[fast_pathway_InputIndex], slowfast_ArrayHostMemory[fast_pathway_InputIndex], slowfast_ArraySize[fast_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);     m_CudaslowfastContext->enqueueV2(slowfast_ArrayDevMemory , m_CudaStream, nullptr);        cudaMemcpyAsync(slowfast_ArrayHostMemory[slow_pathway_OutputIndex], slowfast_ArrayDevMemory[slow_pathway_OutputIndex], slowfast_ArraySize[slow_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);     cudaMemcpyAsync(slowfast_ArrayHostMemory[fast_pathway_OutputIndex], slowfast_ArrayDevMemory[fast_pathway_OutputIndex], slowfast_ArraySize[fast_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);     cudaStreamSynchronize(m_CudaStream);       data=(float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex];     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[slow_pathway_OutputIndex], 0.0625, 32,16,29, 2048,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[0]);     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex], 0.0625, 32,16,29, 256,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[1]);     data=(float*)ROIAlign_ArrayHostMemory[0];     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[0], ROIAlign_ArrayHostMemory[0], ROIAlign_ArraySize[0], cudaMemcpyHostToDevice, m_CudaStream);     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[1], ROIAlign_ArrayHostMemory[1], ROIAlign_ArraySize[1], cudaMemcpyHostToDevice, m_CudaStream);     m_CudaheadContext->enqueueV2(ROIAlign_ArrayDevMemory, m_CudaStream, nullptr);      cudaMemcpyAsync(ROIAlign_ArrayHostMemory[2], ROIAlign_ArrayDevMemory[2], ROIAlign_ArraySize[2], cudaMemcpyDeviceToHost, m_CudaStream);     cudaStreamSynchronize(m_CudaStream);  参考链接1. https://blog.csdn.net/y459541195/article/details/126278476 2. https://blog.csdn.net/WhiffeYF/article/details/115581800 3. https://github.com/facebookresearch/SlowFast

咖啡之翼尹峰，从美女老板变成求职者，背后原因是什么？咖啡之翼创始人尹峰之所以到非你莫属节目去求职，首先是因为她的咖啡之翼终止挂牌，宣告创业失败。其次，她想换个身份，不是以老板而是以求职者身份去参加节目，衡量下自己在职场的价值所在。再刘强东强奸案再次反转，女方承认自愿发生关系，并索赔五百万要和解6月26日，刘强东一案出现巨大反转。该案24日举行听证会，大量与案件相关的视频也首次被媒体曝光。刘静尧曾多次表示发生性关系是双方自愿的，该消息引起热议。刘强东案例反转！女方多次承认黑帮大佬和我的365日的续集来了，却拍成了MV，玩高雅玩砸了2020年的波兰电影黑帮大佬出续集了黑帮大佬和我的365日2。第一部恶评如潮，没想到看了第二部后，观众都觉得第一部真香！第一部结尾，劳拉遭遇意外，我们不知道她是否还活着。但在第二部一刀切的蛇精脸明星怎样了？后遗症不少，年纪越大越明显前些年的娱乐圈，蛇精脸盛行。不少明星为了出挑的颜值，美美的出镜，选择了一刀切的动脸术。当时所谓的一刀切，就是削骨。将下颌多余的骨头去掉，小方脸顿时变成蛇精脸。女星中的代表，是杨幂。当年今日系列7月27日中学语文课文散步作者莫怀戚逝世012014年7月27日中国作家莫怀戚逝世著名作家中学语文课文散步作者莫怀戚先生于7月27日15点45分，因病在家中去世。莫怀戚先生是重庆人，毕业于四川大学中文系，中国当代作家，中美国不断打压中国企业，华为才是最大赢家最近几年，美企给了中国企业不少下马威，一方面是掌握核心科技有恃无恐，而另一方面则是在挑战行业规则，美企不仅响应政府号召，更是认同美国破坏市场秩序的可恶行径，哪怕是连自己的生意都不做华为总裁任正非14年闪亮的基建工程兵军旅生涯大学毕业的任正非，在同时代年轻人中算是比较出色的。他理工基础雄厚，掌握了数门外语，而且文科实用水平颇高，对毛泽东思想有着自己深刻的理解。这些能力集于一身，即便放在今天，也不失为高级华为通告陈春花的后续身为北大教授，学历涉嫌造假？前一段时间，知名管理学者陈春花教授，因华为在心声社区的一则通告而上了热搜。一看通告的内容，措辞极为犀利，华为郑重声明华为不了解她，她也不可能了解华为。这起事件的来由呢？主要是网上很牛皮吹大了！国企员工朋友圈炫富后续来了，官方连夜公布调查结果要问现在什么岗位竞争最激烈，可能人们第一印象就是铁饭碗，尤其是现在就业大环境不乐观，能有份安稳有保障的工作不容易，所以铁饭碗竞争愈发激烈。在众多铁饭碗岗位中，公务员和事业编考试难度一味药治疗荨麻疹我小时候经常起荨麻疹，碰一下，划一下就起，高出皮肤，红，痒。我还记得自己去医院拿药，接诊的是内科主任，边给我看边给同事做思想工作，我那时候小，估计现在就在心里骂了看个病三心二意，庸三年前，精神受刺激，确诊乳腺癌三年前，我父亲去世，我精神受刺激，确诊了乳腺癌。妈妈10年前就去世了，父亲突然的离世，让我心里空落落的，心里一下子就没有了主心骨，爸爸在的时候，我每年回国看望爸爸，吃家乡菜，陪爸爸

<<<<<<－>>>>>>

周鸿祎没有攻不破的网络，SaaS将在中国爆发丨亮见11期丨划重点国家级网络攻击预谋时间长，从渗透到进入目标潜伏下来，中间可能需数年，有可能在机密资料被人拿走后仍无任何察觉。360从一家toC的公司变成了一家toN（国家）的企业，再把服务研维酷睿I3工业平板电脑酷睿I5处理器工业一体机厂家有哪些研维酷睿I3工业平板电脑酷睿I5处理器工业一体机厂家有哪些视频加载中酷睿I5工业平板电脑酷睿I5工业平板电脑酷睿I5工业平板电脑酷睿I5工业平板电脑酷睿工业平板电脑是什么？是指工业中国天眼又有新发现我国科学家日前利用中国天眼FAST对一例位于银河系外的快速射电暴开展了深度观测，首次探测到距离快速射电暴中心仅1个天文单位（即太阳到地球的距离）的周边环境的磁场变化，向着揭示快速射小程序无法下单软件9个月未更新，又一母婴电商凉了？关闭小程序营收净利润双双下滑用户增速疲软，又一家母婴电商不行了？近日，商业那点事儿小编发现，宝贝格子微信小程序格物说发布公告称，平台将于7月1日零点进行系统升级改版。格物说首页尽管资讯用户信息安全有保障，威马获国际权威机构认可文懂车帝原创李德喆懂车帝原创行业懂车帝从威马汽车官方获悉，日前，威马汽车通过国际领先标准测试及认证机构英国标准协会（BSI）颁发的ISOIEC27001（信息安全管理体系）与ISO华为Mate50Pro北斗卫星消息体验我可以不用但你不能没有华为Mate50Pro曾经有这样一个段子，不知道大家听过没有。大概是iPhone刚在国内市场声名鹊起的那段时间，有人吐槽用iPhone地图导航会把人带偏。2013年，iPhone放Reno8pro上手三个月体验流畅度依旧，自研芯片拉满科技力在放暑假之前，我入手了OPPOReno8Pro这部手机，当时第一时间是被它的颜值所吸引。作为一名颜控，实在是忍受不了传统设计和配色的同质化。不过，当时有同学提醒我，这部手机用的是骁移动网络物连接数达16。98亿户规模首超人连接来源人民网原创稿人民网北京9月22日电（记者申佳平）据工业和信息化部官网消息，截至8月末，三家基础电信企业发展蜂窝物联网终端用户16。98亿户，移动网连接终端中代表物连接的蜂窝物联服务间歇性停顿问题优化得物技术1。现象在某个阳光明媚的中午（也有一说是下午），王童鞋急冲冲地跑过来，反馈了一个问题，就是每次点下页面的某个查询按钮，总是固定的等待很久的时间服务器没相应，而且其他功能还伴随间歇性2023年养老金能实现19连涨吗？4个信号出现，看看你都知道吗？养老金对绝大多数退休人员来说，是维持基本生活的唯一经济来源，要想自己晚年生活更舒适，每月到手的养老金自然越多越好。2021年，企业退休人员养老金月平均水平为2987元，今年，职工基哪吒击败蔚小理，成为今年销量首破10万辆的新势力车企？被认为是头部造车新势力的蔚来小鹏理想三家车企，去年销量距离10万辆均只有一步之遥。今年以来，市场不断变化，哪家新势力车企将率先冲破10万辆的销量？在上海中心大厦一层的蔚来中心，一辆