Key Takeaways
- Average latency for ResNet-50 inference on NVIDIA A100 GPU is 1.2 ms at batch size 1
- BERT-Large inference latency on T4 GPU reaches 2.5 ms per query using TensorRT
- Llama 2 7B model first-token latency is 450 ms on a single H100 GPU with vLLM
- Llama 3 70B achieves 150 tokens/sec throughput on 8x H100
- GPT-4 inference throughput is 100 queries/sec on custom cluster
- BERT-Base processes 500 seq/sec on A100 with batch 128
- Average cost of GPT-3.5 inference is $0.0005 per 1K tokens
- Llama 2 70B inference costs $0.0002/1K tokens on AWS
- Claude 2 API inference $3 per million input tokens
- A100 inference power draw is 400W for 70B model at 100 tokens/sec
- H100 SXM consumes 700W delivering 2x Llama perf of A100
- TPU v4 pod slice uses 250W/core for BERT inference
- A100 SXM4 achieves 85% utilization on DLRM reducing energy 15%
- H100 PCIe hits 90% MFU with TensorRT-LLM for Llama 70B
- TPU v5e 75% utilization for PaLM inference at scale
AI inference stats cover models, hardware, latency, throughput, cost, power.
Cost Metrics
Cost Metrics Interpretation
Energy Efficiency
Energy Efficiency Interpretation
Hardware Utilization
Hardware Utilization Interpretation
Latency Metrics
Latency Metrics Interpretation
Throughput Metrics
Throughput Metrics Interpretation
Sources & References
- Reference 1MLCOMMONSmlcommons.orgVisit source
- Reference 2DEVELOPERdeveloper.nvidia.comVisit source
- Reference 3HUGGINGFACEhuggingface.coVisit source
- Reference 4ARXIVarxiv.orgVisit source
- Reference 5CLOUDcloud.google.comVisit source
- Reference 6OPENAIopenai.comVisit source
- Reference 7ONNXRUNTIMEonnxruntime.aiVisit source
- Reference 8AIai.googleblog.comVisit source
- Reference 9AIai.meta.comVisit source
- Reference 10NVIDIAnvidia.comVisit source
- Reference 11MISTRALmistral.aiVisit source
- Reference 12LAMBDAlambda.aiVisit source
- Reference 13GITHUBgithub.comVisit source
- Reference 14MLPERFmlperf.orgVisit source
- Reference 15LLAMAllama.meta.comVisit source
- Reference 16Xx.aiVisit source
- Reference 17NVIDIANEWSnvidianews.nvidia.comVisit source
- Reference 18COHEREcohere.comVisit source
- Reference 19AWSaws.amazon.comVisit source
- Reference 20ANTHROPICanthropic.comVisit source
- Reference 21TOGETHERtogether.aiVisit source
- Reference 22REPLICATEreplicate.comVisit source
- Reference 23AZUREazure.microsoft.comVisit source
- Reference 24FIREWORKSfireworks.aiVisit source
- Reference 25GROQgroq.comVisit source
- Reference 26RUNPODrunpod.ioVisit source
- Reference 27VASTvast.aiVisit source
- Reference 28CORALcoral.aiVisit source
- Reference 29APPLEapple.comVisit source
- Reference 30CEREBRAScerebras.netVisit source
- Reference 31GRAPHCOREgraphcore.aiVisit source
- Reference 32AMDamd.comVisit source
- Reference 33INTELintel.comVisit source
- Reference 34QUALCOMMqualcomm.comVisit source
- Reference 35SAMBANOVAsambanova.aiVisit source
- Reference 36TENSTORRENTtenstorrent.comVisit source
- Reference 37ETCHEDetched.aiVisit source
- Reference 38LIQUIDliquid.aiVisit source
- Reference 39FORUMSforums.developer.nvidia.comVisit source
- Reference 40D-MATRIXd-matrix.aiVisit source
- Reference 41RECURSErecurse.comVisit source
- Reference 42MOSAICMLmosaicml.comVisit source
- Reference 43FLEXFLOWflexflow.aiVisit source






