小说阅读网,魔天记忘语小说,盗墓笔记全集

本文將介紹基于OpenVINO的異步推理隊列類 AyncInferQueue，啟動多個(＞2)推理請求(infer request)，幫助讀者在硬件投入不變的情況下，進一步提升 AI 推理程序的吞吐量(Throughput)。

在閱讀本文前，請讀者先了解使用 start_async() 和 wait() 方法實現基于2個推理請求的異步推理實現方式。該異步推理實現方式相對于同步推理方式，極大提升了 AI 推理程序的吞吐量，但從任務管理器中可以看到，AI 推理硬件的利用率還有很大的提升空間。

這意味著，AI 推理硬件還有潛力可挖，可以通過進一步提高推理請求個數來提升 AI 推理硬件的利用率，從而提高 AI 推理程序的吞吐量。

1.1

推理請求(InferRequest)和流(stream)

OpenVINO 運行時(Runtime)用推理請求(infer request)來抽象在指定計算設備上運行已編譯模型(Compi led_Model)。從編寫程序的角度看，推理請求是一個類，封裝了支持推理請求以同步或異步方式運行的屬性和方法。

推理請求(InferRequest)類的詳細定義參考：

https://github.com/openvinotoolkit/openvino/blob/master/src/inference/include/openvino/runtime/infer_request.hpp#L34

推理請求的個數，由開發(fā)者定義；但計算設備能并行處理的推理請求個數，由硬件本身的處理單元(Processing Unit)決定。超過計算硬件并行處理數量的推理請求，會被計算硬件用隊列儲存起來，當計算硬件空閑后，隊列中的推理請求將被依次取出并執(zhí)行。

OpenVINO用流(stream)來抽象計算設備能并行處理推理請求的能力，通過屬性：“NUM_STREAMS”，可以獲取延遲優(yōu)先或吞吐量優(yōu)先模式下的計算硬件支持的最優(yōu)streams數量，如下表所示。

:上述數據在蝰蛇峽谷上測得，CPU=i7-12700H, 集成顯卡=Iris Xe, 獨立顯卡=A770m

: GPU設備沒有INFERENCE_NUM_THREADS屬性

上述數據測試的源代碼如下，歡迎各位讀者在自己的硬件平臺上測試：

from openvino.runtime import Core, get_version
core = Core()
print(get_version())
print(core.available_devices)
device = device = ['GPU.0', 'GPU.1', 'CPU', 'AUTO', 'AUTO:GPU,-CPU'][0]
cfgs = {}
cfgs['PERFORMANCE_HINT'] = ['THROUGHPUT', 'LATENCY', 'CUMULATIVE_THROUGHPUT'][0]
net = core.compile_model("model.onnx",device,cfgs)
# Get Supported properties
supported_properties = net.get_property('SUPPORTED_PROPERTIES')
print(f'Support properties for {device}:', supported_properties)
opt_nireq = net.get_property('OPTIMAL_NUMBER_OF_INFER_REQUESTS')
print(f'OPTIMAL_NUMBER_OF_INFER_REQUESTS for {device}:', opt_nireq)
nstreams = net.get_property('NUM_STREAMS')
print(f'nstreams for {device}:', nstreams)
performance_hint_num_requests = net.get_property('PERFORMANCE_HINT_NUM_REQUESTS')
print(f'performance_hint_num_requests for {device}:', performance_hint_num_requests)
if device == "CPU":
  # INFERENCE_NUM_THREADS
  inference_num_threads = net.get_property('INFERENCE_NUM_THREADS')
  print(f'inference_num_threads for {device}:', inference_num_threads)
else:
  gpu_queue_priority = net.get_property('GPU_QUEUE_PRIORITY')
print(f'GPU queue priority for {device}:', gpu_queue_priority)

向右滑動查看完整代碼

1.1.1

CPU 的流與推理請求

對于 CPU 來說，一個流(stream)只能服務一個推理請求。通過屬性ov::range_for_streams，可以查到 CPU 支持的流數量的范圍；流的數量無需開發(fā)者使用代碼顯示設置，OpenVINO 運行時會根據延遲優(yōu)先或吞吐量優(yōu)先來自動設置。

1.1.2

GPU 的流與推理請求

對于 GPU 來說，一個流(stream)可以同時服務兩個推理請求。通過屬性 ov::range_for_streams，可以查到 GPU 支持的流數量的范圍：[1, 2]；流的數量無需開發(fā)者使用代碼顯示設置，OpenVINO 運行時會根據延遲優(yōu)先或吞吐量優(yōu)先來自動設置。

參考代碼:

https://www.jianshu.com/p/1748444e6a50

1.2

AsyncInferQueue類

OpenVINO運行時(Runtime)提供 AsyncInferQueue 類來抽象并管理異步推理請求池，其常用方法和屬性有：

__init__(self, compiled_model, jobs = 0)：創(chuàng)建AsyncInferQueue對象

set_callback(func_name)：為推理請求池中所有的推理請求設置統一的回調函數

start_async(inputs, userdata = None)：異步啟動推理請求

wait_all()：等待所有的推理請求執(zhí)行完畢

1.2.1

基于AsyncInferQueue類的

異步推理范例程序

基于 AsyncInferQueue 類 YOLOv5 模型的異步推理范例程序的核心代碼部分如下所示：

完整范例代碼請下載：yolov5_async_infer_queue.py

https://gitee.com/ppov-nuc/yolov5_infer/blob/main/yolov5_async_infer_queue.py

運行代碼前，請參考運行環(huán)境搭建流程。

...
def preprocess(frame):
  # Preprocess the frame
  letterbox_im, _, _= letterbox(frame, auto=False) # preprocess frame by letterbox
  im = letterbox_im.transpose((2, 0, 1))[::-1] # HWC to CHW, BGR to RGB
  im = np.float32(im) / 255.0  # 0 - 255 to 0.0 - 1.0
  blob = im[None] # expand for batch dim
  return blob, letterbox_im.shape[:-1], frame.shape[:-1]
def postprocess(ireq: InferRequest, user_data: tuple):
  result = ireq.results[ireq.model_outputs[0]]
  dets = non_max_suppression(torch.tensor(result))[0].numpy()
  bboxes, scores, class_ids= dets[:,:4], dets[:,4], dets[:,5]
  # rescale the coordinates
  bboxes = scale_coords(user_data[1], bboxes, user_data[2]).astype(int)
  print(user_data[0],"	"+f"{ireq.latency:.3f}"+"	", class_ids)
  return 
# Step1：Initialize OpenVINO Runtime Core
core = Core()
# Step2: Build compiled model
device = device = ['GPU.0', 'GPU.1', 'CPU', 'AUTO', 'AUTO:GPU,-CPU'][0]
cfgs = {}
cfgs['PERFORMANCE_HINT'] = ['THROUGHPUT', 'LATENCY', 'CUMULATIVE_THROUGHPUT'][0]
net = core.compile_model("yolov5s.xml",device,cfgs)
output_node = net.outputs[0]
b,n,input_h,input_w = net.inputs[0].shape
# Step3: Initialize InferQueue
ireqs = AsyncInferQueue(net)
print('Number of infer requests in InferQueue:', len(ireqs))
# Step3.1: Set unified callback on all InferRequests from queue's pool
ireqs.set_callback(postprocess)
# Step4: Read the images
image_folder = "./data/images/"
image_files= os.listdir(image_folder)
print(image_files)
frames = []
for image_file in image_files:
  frame = cv2.imread(os.path.join(image_folder, image_file))
  frames.append(frame)
# 4.1 Warm up
for id, _ in enumerate(ireqs):
  # Preprocess the frame
  start = perf_counter()
  blob, letterbox_shape, frame_shape = preprocess(frames[id % 4])
  end = perf_counter()
  print(f"Preprocess {id}: {(end-start):.4f}.")
  # Run asynchronous inference using the next available InferRequest from the pool
  ireqs.start_async({0:blob},(id, letterbox_shape, frame_shape))
ireqs.wait_all()
# Step5: Benchmark the Async Infer
start = perf_counter()
in_fly = set()
latencies = []
niter = 16
for i in range(niter):
  # Preprocess the frame
  blob, letterbox_shape, frame_shape = preprocess(frames[i % 4]) 
  idle_id = ireqs.get_idle_request_id()
  if idle_id in in_fly:
    latencies.append(ireqs[idle_id].latency)
  else:
    in_fly.add(idle_id)
  # Run asynchronous inference using the next available InferRequest from the pool 
  ireqs.start_async({0:blob},(i, letterbox_shape, frame_shape) )
ireqs.wait_all()

向右滑動查看完整代碼

運行結果如下所示，與基于單個推理請求的start_async()+wait()實現方式相比，基于 AsyncInferQueue 類的 YOLOv5 模型的異步推理程序的吞吐量明顯得到提升。