TensorRT-LLM + NVIDIA Triton
by NVIDIA · free · Last verified 2026-03-17
TensorRT-LLM optimizes large language models into fused CUDA kernels, while the Triton Inference Server orchestrates serving. Together, they form NVIDIA's production stack for maximizing token throughput and minimizing latency on datacenter GPUs, enabling high-performance, scalable LLM inference.
https://github.com/NVIDIA/TensorRT-LLM ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: A+Citations: BEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- In-flight and dynamic batching, Paged-attention and KV caching, FP8 and INT4/INT8 quantization, Multi-GPU and multi-node tensor parallelism, Optimized CUDA kernel fusion, Streaming LLM responses, HTTP/gRPC endpoints via Triton, Concurrent model execution
- Integrations
- pytorch, hugging-face-transformers, kubernetes, prometheus, grafana, docker
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- inference-optimization, llm-serving, nvidia, triton, tensorrt, high-performance, cuda, gpu-acceleration, production-inference, low-latency, quantization
- Added
- 2026-03-17
- Completeness
- 0.95%
Index Score
63.8Adoption
70
Quality
94
Freshness
90
Citations
68
Engagement
0