Performance optimizations¶
Jaeger provides several optional inference backends and precision modes. This page explains how to use them, ordered from least effort to most effort.
All commands below assume Jaeger is installed with GPU support
(pip install jaeger-bio[gpu] or equivalent).
Quick comparison¶
Optimization |
Effort |
Speedup |
Model size |
Best for |
|---|---|---|---|---|
One flag |
1–1.3× |
No change |
Any GPU with FP16/BF16 support |
|
One flag |
1.5–3× after warmup |
No change |
Large datasets with repeated shapes |
|
One conversion |
1.5–2× |
No change |
Reliable cross-platform GPU inference |
|
One conversion |
Similar |
~3.5× smaller |
Edge / mobile / low-storage deployments |
|
Conversion + quantization |
1–1.5× |
~2.5× smaller |
Smallest GPU-deployable model |
|
Custom TF build |
2–5× |
No change |
Maximum GPU performance in specialized containers |
1. Mixed precision¶
Effort: lowest — add one flag to jaeger predict.
Run inference with FP16 or BF16 instead of FP32. This reduces memory bandwidth and can speed up math-bound layers on modern NVIDIA GPUs (Compute Capability ≥ 7.0 for FP16, ≥ 8.0 for BF16).
# FP16 (widely supported)
jaeger predict -i contigs.fasta -o output_dir --precision fp16
# BF16 (Ampere/Ada and newer)
jaeger predict -i contigs.fasta -o output_dir --precision bf16
When to use¶
You want a quick, risk-free speedup with no preprocessing.
Your GPU supports FP16/BF16 tensor cores.
You are not memory-limited (model size stays the same).
Caveats¶
Very small batches may not see a speedup because the overhead of casting can dominate.
Some older GPUs (pre-Volta) have reduced FP16 throughput.
2. XLA JIT compilation¶
Effort: low — add one flag to jaeger predict.
XLA (Accelerated Linear Algebra) JIT-compiles the TensorFlow graph for each unique input shape. After the first compilation, repeated shapes run significantly faster.
jaeger predict -i contigs.fasta -o output_dir --xla
You can combine XLA with mixed precision:
jaeger predict -i contigs.fasta -o output_dir --xla --precision fp16
When to use¶
Large datasets where many windows have the same length.
Benchmarking / repeated inference on the same file.
Caveats¶
The first batch for each unique shape is slow (~10–30 s compilation).
For small or highly variable-length datasets, compilation overhead can exceed the speedup.
3. ONNX Runtime¶
Effort: medium — convert the model once, then run with --onnx.
ONNX Runtime decouples Jaeger from TensorFlow’s execution stack and supports multiple GPU providers (TensorRT, CUDA, CPU) without requiring TensorFlow to be built with those backends.
3.1 Install dependencies¶
pip install onnxruntime-gpu tf2onnx sympy onnx
Important: ONNX Runtime 1.26 requires TensorRT 10. If you have TensorRT 11 installed, downgrade:
pip install tensorrt==10.16.1.11
3.2 Convert the model¶
jaeger utils convert-graph \
-m jaeger_57341_1.5M_fragment \
-o ./optimized \
--mode onnx
This creates:
optimized/
└── jaeger_57341_1.5M_fragment_onnx/
├── jaeger_57341_1.5M_fragment.onnx
├── jaeger_57341_1.5M_fragment_classes.yaml
└── jaeger_57341_1.5M_fragment_project.yaml
3.3 Run inference¶
jaeger predict -i contigs.fasta -o output_dir \
-m jaeger_57341_1.5M_fragment \
--onnx
Provider selection¶
ONNXEngine automatically picks the best available provider:
TensorrtExecutionProvider(NVIDIA GPUs)CUDAExecutionProviderCPUExecutionProvider
The first inference with TensorRT is slow because it builds the TensorRT engine; subsequent runs with the same shape reuse the cached engine.
When to use¶
You want 1.5–2× speedup on NVIDIA GPUs without custom TensorFlow builds.
You need a portable model that can also run on CPU.
4. TFLite quantization¶
Effort: medium — quantize the model once, then run with --quantized.
TFLite is primarily useful for reducing model size for edge or mobile deployment. Speed on desktop GPUs is usually comparable to the original SavedModel.
4.1 Quantize the model¶
# Dynamic range quantization (recommended)
jaeger utils quantize \
-m jaeger_57341_1.5M_fragment \
-o ./quantized \
--mode dynamic
# Float16 weights
jaeger utils quantize \
-m jaeger_57341_1.5M_fragment \
-o ./quantized \
--mode float16
# Full INT8 (experimental)
jaeger utils quantize \
-m jaeger_57341_1.5M_fragment \
-o ./quantized \
--mode full_int8
This creates:
quantized/
└── jaeger_57341_1.5M_fragment_dynamic/
├── jaeger_57341_1.5M_fragment_dynamic.tflite
└── ... metadata ...
4.2 Run inference¶
jaeger predict -i contigs.fasta -o output_dir \
-m jaeger_57341_1.5M_fragment \
--quantized dynamic
When to use¶
Model size matters more than speed (e.g., ~6 MB → ~1.6 MB with dynamic).
Edge devices, mobile apps, or low-bandwidth deployments.
Caveats¶
full_int8is experimental and can reduce accuracy on real data;dynamicis recommended.Desktop GPU speed is usually not faster than the original model.
5. ONNX INT8 quantization¶
Effort: higher — convert to ONNX and then apply static INT8 quantization. Gives the smallest GPU-runnable model.
5.1 Install dependencies¶
Same as ONNX Runtime:
pip install onnxruntime-gpu tf2onnx sympy onnx
5.2 Convert and quantize¶
jaeger utils convert-graph \
-m jaeger_57341_1.5M_fragment \
-o ./optimized \
--mode onnx \
--int8
This creates:
optimized/
└── jaeger_57341_1.5M_fragment_onnx_int8/
├── jaeger_57341_1.5M_fragment_int8.onnx # quantized
├── jaeger_57341_1.5M_fragment.onnx # original FP32
└── ... metadata ...
5.3 Run inference¶
jaeger predict -i contigs.fasta -o output_dir \
-m jaeger_57341_1.5M_fragment \
--onnx --int8
When to use¶
You need the smallest possible model that still runs on GPU (~6 MB → ~2.4 MB).
You want faster-than-SavedModel inference without a custom TF build.
Caveats¶
INT8 ONNX models use the CUDA execution provider, not TensorRT, because TensorRT’s ONNX parser has strict requirements for quantized subgraphs.
Calibration is performed with synthetic one-hot codon tensors. Accuracy is usually very close to FP32, but you should validate on your target data.
6. TensorRT (TF-TRT)¶
Effort: highest — requires TensorFlow built with TensorRT support.
Standard pip-installed TensorFlow does not include TensorRT. This path is only practical if you use NVIDIA’s NGC containers or build TensorFlow from source.
6.1 Use an NGC container¶
docker run --gpus all -it nvcr.io/nvidia/tensorflow:24.10-tf2-py3
Inside the container, install Jaeger and run:
jaeger utils convert-graph \
-m jaeger_57341_1.5M_fragment \
-o ./optimized \
--mode tensorrt
6.2 Easier alternative¶
For most users, ONNX Runtime with TensorRT is the better path: it does not require a custom TensorFlow build and gives most of the speedup.
When to use¶
Maximum GPU performance is required.
You already run workloads in NVIDIA NGC containers.
Environment configuration¶
For ONNX Runtime + TensorRT¶
If you see errors like:
libnvinfer.so.10: cannot open shared object file
libcudnn.so.9: cannot open shared object file
Install the matching pip packages:
# Required for ONNX Runtime 1.26
pip install tensorrt==10.16.1.11
pip install nvidia-cudnn-cu12 # provides libcudnn.so.9
Jaeger’s ONNXEngine automatically preloads these libraries via
ctypes so you usually do not need to set LD_LIBRARY_PATH.
If you still have issues, you can export the library paths manually:
export LD_LIBRARY_PATH="$(python -c 'import site; print(site.getsitepackages()[0])')/tensorrt_libs:$(python -c 'import site; print(site.getsitepackages()[0])')/nvidia/cudnn/lib:${LD_LIBRARY_PATH}"
Verify providers¶
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
Expected output on a working NVIDIA system:
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Choosing an optimization¶
Situation |
Recommended option |
|---|---|
Just want a quick speedup |
|
Large dataset, repeated shapes |
|
Reliable GPU speedup without custom builds |
Convert to ONNX, then |
Need smallest model on GPU |
Convert to ONNX INT8, then |
Need smallest model for edge/mobile |
|
Maximum performance in NGC containers |
|