TensorRT-LLM + NVIDIA Triton
by NVIDIA · open-source · Last verified 2026-03-17
TensorRT-LLM compiles and optimizes LLMs into fused CUDA kernels using NVIDIA's TensorRT compiler, while the Triton Inference Server backend orchestrates dynamic batching, multi-instance serving, and gRPC/HTTP endpoint management. Together they form NVIDIA's recommended production stack for maximizing tokens-per-second on datacenter GPUs.
https://github.com/NVIDIA/TensorRT-LLM ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: A+Citations: BEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- kernel-fusion, in-flight-batching, fp8-quantization, multi-gpu-tensor-parallel, triton-backend
- Integrations
- nvidia-triton, nvidia-a100, nvidia-h100, kubernetes
- Use Cases
- max-throughput-serving, datacenter-llm-api, latency-sensitive-inference, enterprise-ai-deployment
- API Available
- Yes
- Tags
- inference, nvidia, triton, tensorrt, high-performance
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.8Adoption
70
Quality
94
Freshness
90
Citations
68
Engagement
0
Put AI to work for your business
Deploy this integration alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.