Toolmodel-servingv1.0

TensorRT-LLM

by NVIDIA · open-source · Last verified 2026-04-24

TensorRT-LLM is NVIDIA's optimized inference library that compiles LLMs into highly efficient TensorRT engines for maximum GPU utilization. It supports INT4, INT8, and FP8 quantization, in-flight batching, and KV-cache optimization, delivering the best raw throughput on NVIDIA hardware for production deployments.

https://github.com/NVIDIA/TensorRT-LLM ↗

C—Below Average

Adoption: C+Quality: B+Freshness: ACitations: CEngagement: F

Specifications

License: Open Source
Pricing: open-source
Capabilities
Integrations
Use Cases
API Available: No
SDK Languages: python, cpp
Deployment: self-hosted, docker, nvidia-cloud
Rate Limits: N/A (self-hosted, hardware-limited)
Data Privacy: Self-hosted, user-managed
Tags: inference, nvidia, tensorrt, quantization, throughput, production
Added: 2026-04-24
Completeness: 60%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service