# Performance optimizations Jaeger provides several optional inference backends and precision modes. This page explains how to use them, ordered from **least effort** to **most effort**. All commands below assume Jaeger is installed with GPU support (`pip install jaeger-bio[gpu]` or equivalent). --- ## Quick comparison | Optimization | Effort | Speedup | Model size | Best for | |--------------|--------|---------|------------|----------| | [Mixed precision](#1-mixed-precision) | One flag | 1–1.3× | No change | Any GPU with FP16/BF16 support | | [XLA JIT](#2-xla-jit-compilation) | One flag | 1.5–3× after warmup | No change | Large datasets with repeated shapes | | [ONNX Runtime](#3-onnx-runtime) | One conversion | 1.5–2× | No change | Reliable cross-platform GPU inference | | [TFLite quantization](#4-tflite-quantization) | One conversion | Similar | ~3.5× smaller | Edge / mobile / low-storage deployments | | [ONNX INT8](#5-onnx-int8-quantization) | Conversion + quantization | 1–1.5× | ~2.5× smaller | Smallest GPU-deployable model | | [TensorRT (TF-TRT)](#6-tensorrt-tf-trt) | Custom TF build | 2–5× | No change | Maximum GPU performance in specialized containers | --- ## 1. Mixed precision **Effort:** lowest — add one flag to `jaeger predict`. Run inference with FP16 or BF16 instead of FP32. This reduces memory bandwidth and can speed up math-bound layers on modern NVIDIA GPUs (Compute Capability ≥ 7.0 for FP16, ≥ 8.0 for BF16). ```bash # FP16 (widely supported) jaeger predict -i contigs.fasta -o output_dir --precision fp16 # BF16 (Ampere/Ada and newer) jaeger predict -i contigs.fasta -o output_dir --precision bf16 ``` ### When to use - You want a quick, risk-free speedup with no preprocessing. - Your GPU supports FP16/BF16 tensor cores. - You are not memory-limited (model size stays the same). ### Caveats - Very small batches may not see a speedup because the overhead of casting can dominate. - Some older GPUs (pre-Volta) have reduced FP16 throughput. --- ## 2. XLA JIT compilation **Effort:** low — add one flag to `jaeger predict`. XLA (Accelerated Linear Algebra) JIT-compiles the TensorFlow graph for each unique input shape. After the first compilation, repeated shapes run significantly faster. ```bash jaeger predict -i contigs.fasta -o output_dir --xla ``` You can combine XLA with mixed precision: ```bash jaeger predict -i contigs.fasta -o output_dir --xla --precision fp16 ``` ### When to use - Large datasets where many windows have the same length. - Benchmarking / repeated inference on the same file. ### Caveats - The first batch for each unique shape is slow (~10–30 s compilation). - For small or highly variable-length datasets, compilation overhead can exceed the speedup. --- ## 3. ONNX Runtime **Effort:** medium — convert the model once, then run with `--onnx`. ONNX Runtime decouples Jaeger from TensorFlow's execution stack and supports multiple GPU providers (TensorRT, CUDA, CPU) without requiring TensorFlow to be built with those backends. ### 3.1 Install dependencies ```bash pip install onnxruntime-gpu tf2onnx sympy onnx ``` **Important:** ONNX Runtime 1.26 requires **TensorRT 10**. If you have TensorRT 11 installed, downgrade: ```bash pip install tensorrt==10.16.1.11 ``` ### 3.2 Convert the model ```bash jaeger utils convert-graph \ -m jaeger_57341_1.5M_fragment \ -o ./optimized \ --mode onnx ``` This creates: ``` optimized/ └── jaeger_57341_1.5M_fragment_onnx/ ├── jaeger_57341_1.5M_fragment.onnx ├── jaeger_57341_1.5M_fragment_classes.yaml └── jaeger_57341_1.5M_fragment_project.yaml ``` ### 3.3 Run inference ```bash jaeger predict -i contigs.fasta -o output_dir \ -m jaeger_57341_1.5M_fragment \ --onnx ``` ### Provider selection `ONNXEngine` automatically picks the best available provider: 1. `TensorrtExecutionProvider` (NVIDIA GPUs) 2. `CUDAExecutionProvider` 3. `CPUExecutionProvider` The first inference with TensorRT is slow because it builds the TensorRT engine; subsequent runs with the same shape reuse the cached engine. ### When to use - You want 1.5–2× speedup on NVIDIA GPUs without custom TensorFlow builds. - You need a portable model that can also run on CPU. --- ## 4. TFLite quantization **Effort:** medium — quantize the model once, then run with `--quantized`. TFLite is primarily useful for reducing model size for edge or mobile deployment. Speed on desktop GPUs is usually comparable to the original SavedModel. ### 4.1 Quantize the model ```bash # Dynamic range quantization (recommended) jaeger utils quantize \ -m jaeger_57341_1.5M_fragment \ -o ./quantized \ --mode dynamic # Float16 weights jaeger utils quantize \ -m jaeger_57341_1.5M_fragment \ -o ./quantized \ --mode float16 # Full INT8 (experimental) jaeger utils quantize \ -m jaeger_57341_1.5M_fragment \ -o ./quantized \ --mode full_int8 ``` This creates: ``` quantized/ └── jaeger_57341_1.5M_fragment_dynamic/ ├── jaeger_57341_1.5M_fragment_dynamic.tflite └── ... metadata ... ``` ### 4.2 Run inference ```bash jaeger predict -i contigs.fasta -o output_dir \ -m jaeger_57341_1.5M_fragment \ --quantized dynamic ``` ### When to use - Model size matters more than speed (e.g., ~6 MB → ~1.6 MB with dynamic). - Edge devices, mobile apps, or low-bandwidth deployments. ### Caveats - `full_int8` is experimental and can reduce accuracy on real data; `dynamic` is recommended. - Desktop GPU speed is usually not faster than the original model. --- ## 5. ONNX INT8 quantization **Effort:** higher — convert to ONNX and then apply static INT8 quantization. Gives the smallest GPU-runnable model. ### 5.1 Install dependencies Same as [ONNX Runtime](#31-install-dependencies): ```bash pip install onnxruntime-gpu tf2onnx sympy onnx ``` ### 5.2 Convert and quantize ```bash jaeger utils convert-graph \ -m jaeger_57341_1.5M_fragment \ -o ./optimized \ --mode onnx \ --int8 ``` This creates: ``` optimized/ └── jaeger_57341_1.5M_fragment_onnx_int8/ ├── jaeger_57341_1.5M_fragment_int8.onnx # quantized ├── jaeger_57341_1.5M_fragment.onnx # original FP32 └── ... metadata ... ``` ### 5.3 Run inference ```bash jaeger predict -i contigs.fasta -o output_dir \ -m jaeger_57341_1.5M_fragment \ --onnx --int8 ``` ### When to use - You need the smallest possible model that still runs on GPU (~6 MB → ~2.4 MB). - You want faster-than-SavedModel inference without a custom TF build. ### Caveats - INT8 ONNX models use the **CUDA** execution provider, not TensorRT, because TensorRT's ONNX parser has strict requirements for quantized subgraphs. - Calibration is performed with synthetic one-hot codon tensors. Accuracy is usually very close to FP32, but you should validate on your target data. --- ## 6. TensorRT (TF-TRT) **Effort:** highest — requires TensorFlow built with TensorRT support. Standard pip-installed TensorFlow does **not** include TensorRT. This path is only practical if you use NVIDIA's NGC containers or build TensorFlow from source. ### 6.1 Use an NGC container ```bash docker run --gpus all -it nvcr.io/nvidia/tensorflow:24.10-tf2-py3 ``` Inside the container, install Jaeger and run: ```bash jaeger utils convert-graph \ -m jaeger_57341_1.5M_fragment \ -o ./optimized \ --mode tensorrt ``` ### 6.2 Easier alternative For most users, [ONNX Runtime with TensorRT](#3-onnx-runtime) is the better path: it does not require a custom TensorFlow build and gives most of the speedup. ### When to use - Maximum GPU performance is required. - You already run workloads in NVIDIA NGC containers. --- ## Environment configuration ### For ONNX Runtime + TensorRT If you see errors like: ``` libnvinfer.so.10: cannot open shared object file libcudnn.so.9: cannot open shared object file ``` Install the matching pip packages: ```bash # Required for ONNX Runtime 1.26 pip install tensorrt==10.16.1.11 pip install nvidia-cudnn-cu12 # provides libcudnn.so.9 ``` Jaeger's `ONNXEngine` automatically preloads these libraries via `ctypes` so you usually do **not** need to set `LD_LIBRARY_PATH`. If you still have issues, you can export the library paths manually: ```bash export LD_LIBRARY_PATH="$(python -c 'import site; print(site.getsitepackages()[0])')/tensorrt_libs:$(python -c 'import site; print(site.getsitepackages()[0])')/nvidia/cudnn/lib:${LD_LIBRARY_PATH}" ``` ### Verify providers ```python python -c "import onnxruntime as ort; print(ort.get_available_providers())" ``` Expected output on a working NVIDIA system: ```python ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'] ``` --- ## Choosing an optimization | Situation | Recommended option | |-----------|--------------------| | Just want a quick speedup | `--precision fp16` | | Large dataset, repeated shapes | `--xla --precision fp16` | | Reliable GPU speedup without custom builds | Convert to ONNX, then `--onnx` | | Need smallest model on GPU | Convert to ONNX INT8, then `--onnx --int8` | | Need smallest model for edge/mobile | `jaeger utils quantize --mode dynamic` | | Maximum performance in NGC containers | `jaeger utils convert-graph --mode tensorrt` |