OpenVINO™ Runtime Python API Advanced Inference#
Warning
All methods described here are highly dependent on the specific hardware and software setup. Consider conducting your own experiments with various models and different input/output sizes. The methods presented here are not universal, they may or may not apply to the specific pipeline. Consider all trade-offs and avoid premature optimizations.
Direct Inference with CompiledModel
#
The CompiledModel
class provides the __call__
method that runs a single synchronous
inference using the given model. In addition to a compact code, all future calls
to CompiledModel.__call__
will result in less overhead,
as the object reuses the already created InferRequest
.
# Calling CompiledModel creates and saves InferRequest object
results_0 = compiled_model({"input_0": data_0, "input_1": data_1})
# Second call reuses previously created InferRequest object
results_1 = compiled_model({"input_0": data_2, "input_1": data_3})
Hiding Latency with Asynchronous Calls#
Asynchronous calls allow to hide latency to optimize overall runtime of a codebase.
For example, InferRequest.start_async
releases the GIL and provides non-blocking call.
It is beneficial to process other calls while waiting to finish compute-intensive inference.
Example usage:
import time
# Long running function
def run(time_in_sec):
time.sleep(time_in_sec)
# No latency hiding
results = request.infer({"input_0": data_0, "input_1": data_1})[0]
run(time_in_sec)
# Hiding latency
request.start_async({"input_0": data_0, "input_1": data_1})
run(time_in_sec)
request.wait()
results = request.get_output_tensor(0).data # Gather data from InferRequest
Note
It is up to the user/developer to optimize the flow in a codebase to benefit from potential parallelization.
Postponed Return with Asynchronous Calls#
Postponed Return is a practice to reduce overhead of returning OVDict
from
synchronous calls. Postponed Return is useful in the following cases:
only a part of output data is required. For example, only one specific output is significant in a given pipeline step and all outputs are large and expensive to copy.
data is not required immediately. For example, it can be later extracted inside the pipeline as a part of latency hiding.
data return is not required at all. For example, models are being chained with the pure
Tensor
interface.
# Standard approach
results = request.infer({"input_0": data_0, "input_1": data_1})[0]
# "Postponed Return" approach
request.start_async({"input_0": data_0, "input_1": data_1})
request.wait()
results = request.get_output_tensor(0).data # Gather data "on demand" from InferRequest