Key Takeaways
- Phi-3-mini has 3.8 billion parameters and outperforms models twice its size on HumanEval
- Gemma-2B model contains exactly 2 billion parameters optimized for mobile deployment
- TinyLlama 1.1B has 1.1 billion parameters trained on 3 trillion tokens
- Phi-3-mini trained on 3.3 trillion tokens costing under $10M
- Gemma 2B trained with 6 trillion tokens in under 1 week on TPUs
- TinyLlama 1.1B trained on 3T tokens using only 16 A100 GPUs
- Phi-3-mini achieves 1.5 tokens/second on iPhone 14 CPU inference
- Gemma-2B runs at 20+ tokens/sec on single GPU quantized
- TinyLlama 1.1B infers at 50 tokens/sec on A100 GPU
- Phi-3-mini scores 68.8% on MMLU 5-shot
- Gemma-2B achieves 64.3% on MMLU benchmark
- TinyLlama scores 58.8% on MMLU zero-shot
- Phi-3-mini deployed on Azure AI at 10x cost savings vs Llama2-70B
- Gemma-2B integrated into Android apps for on-device AI
- TinyLlama adopted in 1M+ HuggingFace downloads monthly
Small language models have varied params, performance, training, deployment, stats.
Benchmark Results
- Phi-3-mini scores 68.8% on MMLU 5-shot
- Gemma-2B achieves 64.3% on MMLU benchmark
- TinyLlama scores 58.8% on MMLU zero-shot
- Phi-2 reaches 56.9% on MMLU and 78% on HumanEval
- Qwen1.5-0.5B scores 52.5% on MMLU multilingual
- StableLM-2-Zephyr-1.6B 62.3% on MMLU chat eval
- OpenELM-270M 45.2% on ARC-Challenge
- MobileLLaMA-1.4B 55% on GSM8K math benchmark
- SmolLM-135M achieves 20.21% on ARC-Challenge
- DistilBERT 97% of BERT performance on GLUE at 40% size
- MiniLM-L6 scores 74.9 on GLUE average
- Phi-1 50.6% on HumanEval coding benchmark
- Gemma-7B 64.3% MMLU matching larger models
- RWKV-1B5 52% on PIQA commonsense
- H2O-Danube-1.8B 59.2% on MMLU
- Pythia-1B 35.2% on Hellaswag
- OPT-125M 25.4% on LAMBADA perplexity eval
- T5-small 70.8 on XSum ROUGE score
- FLAN-T5-small 62.5% on Natural Questions
- LaMini-Flan-T5-248M 45% on MMLU instruction
- mT5-small 78.5% on multilingual GLUE
- Qwen2-0.5B 58.1% on MMLU improved
Benchmark Results Interpretation
Deployment and Adoption
- Phi-3-mini deployed on Azure AI at 10x cost savings vs Llama2-70B
- Gemma-2B integrated into Android apps for on-device AI
- TinyLlama adopted in 1M+ HuggingFace downloads monthly
- Phi-2 used in GitHub Copilot mobile features
- Qwen1.5 series downloaded 50M+ times on HF
- StableLM-2 in enterprise chatbots reducing latency 70%
- OpenELM powers Apple on-device research prototypes
- MobileLLaMA in Samsung Galaxy AI features
- SmolLM used in browser-based AI demos 100k users
- DistilBERT deployed in 1000+ production NLP apps
- MiniLM in Microsoft Bing search ranking
- Phi-1 inspired 500+ community fine-tunes
- Gemma licensed for commercial use in 10M devices
- RWKV in real-time voice assistants
- H2O-Danube integrated into H2O.ai platform for business
- Pythia suite benchmarked in 200+ research papers
- OPT-125M forked 10k times on HF for custom apps
- T5-small in Google Translate edge inference
- FLAN-T5 powering 50+ instruction-tuned apps
- LaMini-Flan-T5 in low-resource language tools
- mT5-small adopted for 50+ languages in apps
- Qwen2 small models in Alibaba cloud services 1M queries/day
Deployment and Adoption Interpretation
Inference Speed
- Phi-3-mini achieves 1.5 tokens/second on iPhone 14 CPU inference
- Gemma-2B runs at 20+ tokens/sec on single GPU quantized
- TinyLlama 1.1B infers at 50 tokens/sec on A100 GPU
- Phi-2 achieves 30 tokens/sec on CPU with ONNX
- Qwen1.5-0.5B reaches 100+ tokens/sec on mobile devices
- StableLM-2-1.6B quantized to 4-bit runs 4x faster
- OpenELM-270M infers at 2x speed of Llama-7B per param
- MobileLLaMA-1.4B achieves 40 tokens/sec on smartphone CPU
- SmolLM-135M runs at 150 tokens/sec on laptop CPU
- DistilBERT 60% faster inference than BERT-base
- MiniLM-L6 5x faster than BERT-large on CPU
- Phi-1 optimized for 25 tokens/sec on edge devices
- Gemma-7B Q4_K_M 10 tokens/sec on consumer GPU
- RWKV-1B5 linear scaling enables 100 tokens/sec streaming
- H2O-Danube-1.8B 3x faster than Mistral-7B on CPU
- Pythia-1B decodes at 40 tokens/sec with FlashAttention
- OPT-125M achieves 200 tokens/sec on GPU batch=1
- T5-small infers 2x faster than full T5-base
- FLAN-T5-small 1.5x speedup over T5-small untuned
- LaMini-Flan-T5-248M runs on 4GB RAM devices
- mT5-small 30% faster multilingual inference
- Qwen2-0.5B achieves 80 tokens/sec on ARM CPU
Inference Speed Interpretation
Model Parameters and Size
- Phi-3-mini has 3.8 billion parameters and outperforms models twice its size on HumanEval
- Gemma-2B model contains exactly 2 billion parameters optimized for mobile deployment
- TinyLlama 1.1B has 1.1 billion parameters trained on 3 trillion tokens
- Microsoft Phi-2 features 2.7 billion parameters and matches GPT-3.5 performance
- Qwen1.5-0.5B has 0.5 billion parameters and scores 52.5 on MMLU
- StableLM-2-Zephyr-1_6B has 1.6 billion parameters fine-tuned for chat
- OpenELM-270M contains 270 million parameters with 12B token training
- MobileLLaMA-1.4B has 1.4 billion parameters designed for edge devices
- SmolLM-135M has 135 million parameters achieving 20.21 on ARC-Challenge
- Bert-base-uncased has 110 million parameters as a foundational small model
- DistilBERT has 66 million parameters, 40% smaller than BERT-base
- MiniLM-L6-50 has around 22 million parameters for efficient NLP
- Phi-1 has 1.3 billion parameters trained on textbook-quality data
- Gemma-7B has 7 billion parameters but quantized to 4-bit for small footprint
- RWKV-1B5 has 1.5 billion parameters using RNN architecture
- H2O-Danube-1.8B has 1.8 billion parameters for multilingual tasks
- Pythia-1B has 1 billion parameters from EleutherAI suite
- OPT-125M has 125 million parameters as smallest OPT variant
- T5-small has 60 million parameters for text-to-text tasks
- FLAN-T5-small has 77 million parameters fine-tuned for instruction
- LaMini-Flan-T5-248M has 248 million parameters for low-resource
- mT5-small has 300 million parameters multilingual
- Phi-3-vision-128k-instruct has 4.2 billion parameters including vision
- Qwen2-0.5B has 0.5 billion parameters with improved coding
Model Parameters and Size Interpretation
Training Efficiency
- Phi-3-mini trained on 3.3 trillion tokens costing under $10M
- Gemma 2B trained with 6 trillion tokens in under 1 week on TPUs
- TinyLlama 1.1B trained on 3T tokens using only 16 A100 GPUs
- Phi-2 trained on 1.4T tokens of synthetic data in 14 days
- Qwen1.5-0.5B trained with filtered high-quality data reducing compute by 50%
- StableLM-2 1.6B trained on 1.6T tokens with alignment
- OpenELM models trained on 750B OpenOrca tokens efficiently
- MobileLLaMA trained on 1T tokens optimized for mobile FLOPs
- SmolLM trained on 600B filtered tokens from HuggingFace
- DistilBERT distilled from BERT using 3x less compute
- MiniLM trained with knowledge distillation halving latency
- Phi-1 trained solely on 7B textbook tokens
- Gemma models used group-query attention reducing training memory 20%
- RWKV trained linearly without quadratic attention compute
- H2O-Danube trained on 1T multilingual tokens affordably
- Pythia trained transparently on 300B The Pile dataset
- OPT-125M trained on 180B tokens openly
- T5-small pre-trained on C4 dataset with 60M params efficiency
- FLAN-T5 used chain-of-thought distillation for efficiency
- LaMini-Flan-T5 trained on 2.6T diverse instructions
- mT5-small trained on mC4 for 101 languages
- Qwen2 trained with reject sampling improving quality per FLOP
Training Efficiency Interpretation
Sources & References
- Reference 1ARXIVarxiv.orgVisit source
- Reference 2AIai.google.devVisit source
- Reference 3HUGGINGFACEhuggingface.coVisit source
- Reference 4QWENLMqwenlm.github.ioVisit source
- Reference 5AZUREazure.microsoft.comVisit source
- Reference 6BLOGblog.googleVisit source
- Reference 7MICROSOFTmicrosoft.comVisit source
- Reference 8STABILITYstability.aiVisit source
- Reference 9H2Oh2o.aiVisit source
- Reference 10MACHINELEARNINGmachinelearning.apple.comVisit source
- Reference 11GITHUBgithub.comVisit source






