Performance optimizations¶

Jaeger provides several optional inference backends and precision modes. This page explains how to use them, ordered from least effort to most effort.

All commands below assume Jaeger is installed with GPU support (pip install jaeger-bio[gpu] or equivalent).

Quick comparison¶

Optimization	Effort	Speedup	Model size	Best for
Mixed precision	One flag	1–1.3×	No change	Any GPU with FP16/BF16 support
XLA JIT	One flag	1.5–3× after warmup	No change	Large datasets with repeated shapes
ONNX Runtime	One conversion	1.5–2×	No change	Reliable cross-platform GPU inference
TFLite quantization	One conversion	Similar	~3.5× smaller	Edge / mobile / low-storage deployments
ONNX INT8	Conversion + quantization	1–1.5×	~2.5× smaller	Smallest GPU-deployable model
TensorRT (TF-TRT)	Custom TF build	2–5×	No change	Maximum GPU performance in specialized containers

1. Mixed precision¶

Effort: lowest — add one flag to jaeger predict.

Run inference with FP16 or BF16 instead of FP32. This reduces memory bandwidth and can speed up math-bound layers on modern NVIDIA GPUs (Compute Capability ≥ 7.0 for FP16, ≥ 8.0 for BF16).

# FP16 (widely supported)
jaeger predict -i contigs.fasta -o output_dir --precision fp16

# BF16 (Ampere/Ada and newer)
jaeger predict -i contigs.fasta -o output_dir --precision bf16

When to use¶

You want a quick, risk-free speedup with no preprocessing.
Your GPU supports FP16/BF16 tensor cores.
You are not memory-limited (model size stays the same).

Caveats¶

Very small batches may not see a speedup because the overhead of casting can dominate.
Some older GPUs (pre-Volta) have reduced FP16 throughput.

2. XLA JIT compilation¶

Effort: low — add one flag to jaeger predict.

XLA (Accelerated Linear Algebra) JIT-compiles the TensorFlow graph for each unique input shape. After the first compilation, repeated shapes run significantly faster.

jaeger predict -i contigs.fasta -o output_dir --xla

You can combine XLA with mixed precision:

jaeger predict -i contigs.fasta -o output_dir --xla --precision fp16

When to use¶

Large datasets where many windows have the same length.
Benchmarking / repeated inference on the same file.

Caveats¶

The first batch for each unique shape is slow (~10–30 s compilation).
For small or highly variable-length datasets, compilation overhead can exceed the speedup.

3. ONNX Runtime¶

Effort: medium — convert the model once, then run with --onnx.

ONNX Runtime decouples Jaeger from TensorFlow’s execution stack and supports multiple GPU providers (TensorRT, CUDA, CPU) without requiring TensorFlow to be built with those backends.

3.1 Install dependencies¶

pip install onnxruntime-gpu tf2onnx sympy onnx

Important: ONNX Runtime 1.26 requires TensorRT 10. If you have TensorRT 11 installed, downgrade:

pip install tensorrt==10.16.1.11

3.2 Convert the model¶

jaeger utils convert-graph \
  -m jaeger_57341_1.5M_fragment \
  -o ./optimized \
  --mode onnx

This creates:

optimized/
└── jaeger_57341_1.5M_fragment_onnx/
    ├── jaeger_57341_1.5M_fragment.onnx
    ├── jaeger_57341_1.5M_fragment_classes.yaml
    └── jaeger_57341_1.5M_fragment_project.yaml

3.3 Run inference¶

jaeger predict -i contigs.fasta -o output_dir \
  -m jaeger_57341_1.5M_fragment \
  --onnx

Provider selection¶

ONNXEngine automatically picks the best available provider:

TensorrtExecutionProvider (NVIDIA GPUs)
CUDAExecutionProvider
CPUExecutionProvider

The first inference with TensorRT is slow because it builds the TensorRT engine; subsequent runs with the same shape reuse the cached engine.

When to use¶

You want 1.5–2× speedup on NVIDIA GPUs without custom TensorFlow builds.
You need a portable model that can also run on CPU.

4. TFLite quantization¶

Effort: medium — quantize the model once, then run with --quantized.

TFLite is primarily useful for reducing model size for edge or mobile deployment. Speed on desktop GPUs is usually comparable to the original SavedModel.

4.1 Quantize the model¶

# Dynamic range quantization (recommended)
jaeger utils quantize \
  -m jaeger_57341_1.5M_fragment \
  -o ./quantized \
  --mode dynamic

# Float16 weights
jaeger utils quantize \
  -m jaeger_57341_1.5M_fragment \
  -o ./quantized \
  --mode float16

# Full INT8 (experimental)
jaeger utils quantize \
  -m jaeger_57341_1.5M_fragment \
  -o ./quantized \
  --mode full_int8

This creates:

quantized/
└── jaeger_57341_1.5M_fragment_dynamic/
    ├── jaeger_57341_1.5M_fragment_dynamic.tflite
    └── ... metadata ...

4.2 Run inference¶

jaeger predict -i contigs.fasta -o output_dir \
  -m jaeger_57341_1.5M_fragment \
  --quantized dynamic

When to use¶

Model size matters more than speed (e.g., ~6 MB → ~1.6 MB with dynamic).
Edge devices, mobile apps, or low-bandwidth deployments.

Caveats¶

full_int8 is experimental and can reduce accuracy on real data; dynamic is recommended.
Desktop GPU speed is usually not faster than the original model.

5. ONNX INT8 quantization¶

Effort: higher — convert to ONNX and then apply static INT8 quantization. Gives the smallest GPU-runnable model.

5.1 Install dependencies¶

Same as ONNX Runtime:

pip install onnxruntime-gpu tf2onnx sympy onnx

5.2 Convert and quantize¶

jaeger utils convert-graph \
  -m jaeger_57341_1.5M_fragment \
  -o ./optimized \
  --mode onnx \
  --int8

This creates:

optimized/
└── jaeger_57341_1.5M_fragment_onnx_int8/
    ├── jaeger_57341_1.5M_fragment_int8.onnx   # quantized
    ├── jaeger_57341_1.5M_fragment.onnx        # original FP32
    └── ... metadata ...

5.3 Run inference¶

jaeger predict -i contigs.fasta -o output_dir \
  -m jaeger_57341_1.5M_fragment \
  --onnx --int8

When to use¶

You need the smallest possible model that still runs on GPU (~6 MB → ~2.4 MB).
You want faster-than-SavedModel inference without a custom TF build.

Caveats¶

INT8 ONNX models use the CUDA execution provider, not TensorRT, because TensorRT’s ONNX parser has strict requirements for quantized subgraphs.
Calibration is performed with synthetic one-hot codon tensors. Accuracy is usually very close to FP32, but you should validate on your target data.

6. TensorRT (TF-TRT)¶

Effort: highest — requires TensorFlow built with TensorRT support.

Standard pip-installed TensorFlow does not include TensorRT. This path is only practical if you use NVIDIA’s NGC containers or build TensorFlow from source.

6.1 Use an NGC container¶

docker run --gpus all -it nvcr.io/nvidia/tensorflow:24.10-tf2-py3

Inside the container, install Jaeger and run:

jaeger utils convert-graph \
  -m jaeger_57341_1.5M_fragment \
  -o ./optimized \
  --mode tensorrt

6.2 Easier alternative¶

For most users, ONNX Runtime with TensorRT is the better path: it does not require a custom TensorFlow build and gives most of the speedup.

When to use¶

Maximum GPU performance is required.
You already run workloads in NVIDIA NGC containers.

Environment configuration¶

For ONNX Runtime + TensorRT¶

If you see errors like:

libnvinfer.so.10: cannot open shared object file
libcudnn.so.9: cannot open shared object file

Install the matching pip packages:

# Required for ONNX Runtime 1.26
pip install tensorrt==10.16.1.11
pip install nvidia-cudnn-cu12  # provides libcudnn.so.9

Jaeger’s ONNXEngine automatically preloads these libraries via ctypes so you usually do not need to set LD_LIBRARY_PATH. If you still have issues, you can export the library paths manually:

export LD_LIBRARY_PATH="$(python -c 'import site; print(site.getsitepackages()[0])')/tensorrt_libs:$(python -c 'import site; print(site.getsitepackages()[0])')/nvidia/cudnn/lib:${LD_LIBRARY_PATH}"

Verify providers¶

python -c "import onnxruntime as ort; print(ort.get_available_providers())"

Expected output on a working NVIDIA system:

['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Choosing an optimization¶

Situation	Recommended option
Just want a quick speedup	`--precision fp16`
Large dataset, repeated shapes	`--xla --precision fp16`
Reliable GPU speedup without custom builds	Convert to ONNX, then `--onnx`
Need smallest model on GPU	Convert to ONNX INT8, then `--onnx --int8`
Need smallest model for edge/mobile	`jaeger utils quantize --mode dynamic`
Maximum performance in NGC containers	`jaeger utils convert-graph --mode tensorrt`