GPU Benchmarking
Comprehensive performance analysis of Generate Nano on Qualcomm QCM6490 and QCM8550 chipsets.
Overview
Generate Nano has been optimized for Qualcomm's QCM6490 and QCM8550 chipsets, leveraging their GPU capabilities to maximize performance for on-device AI inference. This benchmarking report provides detailed performance metrics across various GPU/CPU configurations to help you determine the optimal settings for your deployment.
Executive Summary
QCM6490 Baseline
7.75 tok/s
QCM6490 Best
14.77 tok/s
QCM8550 Baseline
12.26 tok/s
QCM8550 Best
23.47 tok/s
Our benchmarking tests demonstrate that the QCM8550 chipset delivers approximately 59% higher performance than the QCM6490 when comparing optimal configurations. Both chipsets show significant performance improvements when utilizing GPU acceleration compared to CPU-only processing.
Key Findings
- Optimal GPU Configuration: For both chipsets, the "Optimal GPU" configuration (27 GPU layers, 4 CPU threads) provides the best balance of performance and efficiency.
- Diminishing Returns: Configurations with more than 27 GPU layers show diminishing returns and can even decrease performance.
- Memory Usage: The QCM8550 demonstrates significantly lower memory usage (422MB vs 1560MB) while delivering higher performance.
- Efficiency: The QCM8550 achieves 5.85x better efficiency (tok/s/MB) compared to the QCM6490.
Chipset Comparison
| Metric | QCM6490 | QCM8550 | Improvement |
|---|---|---|---|
| Peak Performance | 14.77 tok/s | 23.47 tok/s | +59% |
| CPU Baseline | 7.75 tok/s | 12.26 tok/s | +58% |
| Memory Usage (Optimal) | 1560 MB | 422 MB | -73% |
| Efficiency (tok/s/MB) | 0.0095 | 0.0556 | +485% |
| Time to First Token | 1815 ms | 742 ms | -59% |
| Total Processing Time | 20310 ms | 12781 ms | -37% |
The QCM8550 chipset demonstrates superior performance across all metrics, with particularly significant improvements in memory efficiency and response time. The reduced memory footprint allows for more complex models or concurrent applications on the same device.
Methodology
Our benchmarking methodology ensures consistent and reliable performance measurements across different configurations and chipsets.
Test Environment
All tests were conducted using the following standardized environment:
- Model: Interplay Think 0.6B Q4_0 (4-bit quantized, 600M parameters)
- Prompt: Technical AI/ML explanation (~300 tokens)
- Response Length: 300 tokens (consistent across all tests)
- Temperature: 0.7
- Ambient Temperature: 22-24°C (controlled environment)
- Battery State: Plugged in, 100% charged
- Background Processes: Minimized to ensure consistent results
Metrics Collected
- TTFT (Time to First Token): Time from query submission to first token generation
- Total Time: Complete processing time for the entire response
- TPS (Tokens Per Second): Average number of tokens generated per second
- Memory Usage: Peak and average memory consumption during inference
- Power Efficiency: Tokens per second per percentage of battery used
Configuration Matrix
We tested multiple configurations with varying distributions of workload between GPU and CPU:
- GPU Layers: Number of transformer layers assigned to the GPU (0-48)
- CPU Threads: Number of CPU threads utilized (4 across all tests)
- Processing Mode: CPU Only, Hybrid (Balanced), Hybrid (GPU-Heavy), High GPU, Max GPU
QCM6490 Benchmarks
The Qualcomm QCM6490 chipset demonstrates strong performance for on-device AI inference, particularly when utilizing the optimal GPU configuration.
Performance Metrics
QCM6490Tokens Per Second (TPS)
| Configuration | TTFT (ms) | Total (ms) | TPS | Memory (MB) | Efficiency | Rating |
|---|---|---|---|---|---|---|
| Optimal GPU | 1815 | 20310 | 14.77 | 1560 | 0.0095 | Very Good |
| Custom (27L/4T) | 1820 | 20326 | 14.76 | 1546 | 0.0095 | Very Good |
| Balanced GPU | 2156 | 21824 | 13.75 | 1547 | 0.0089 | Very Good |
| Medium GPU | 3290 | 27608 | 10.87 | 1537 | 0.0071 | Good |
| Light GPU | 4411 | 33496 | 8.96 | 1549 | 0.0058 | Fair |
| CPU Only | 5426 | 38706 | 7.75 | 1453 | 0.0053 | Fair |
| High GPU | 1785 | 55724 | 5.38 | 1820 | 0.0030 | Poor |
| Very High GPU | 1804 | 55996 | 5.36 | 1830 | 0.0029 | Poor |
QCM6490 Recommendations
- For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 14.77 tok/s (1.91x speedup over CPU-only)
- For Best Efficiency: Use Custom (27L/4T) configuration - 0.0095 tok/s/MB
- For Lowest Memory: Use CPU Only configuration - 1453 MB
- For Best Battery Life: Use Optimal GPU configuration - 14771 tok/s per % battery
QCM8550 Benchmarks
The Qualcomm QCM8550 chipset delivers exceptional performance for on-device AI inference, with significant improvements in both speed and efficiency compared to the QCM6490.
Performance Metrics
QCM8550Tokens Per Second (TPS)
| Configuration | TTFT (ms) | Total (ms) | TPS | Memory (MB) | Efficiency | Rating |
|---|---|---|---|---|---|---|
| Optimal GPU | 742 | 12781 | 23.47 | 422 | 0.0556 | Excellent |
| Balanced GPU | 911 | 13296 | 22.56 | 478 | 0.0472 | Excellent |
| High GPU | 684 | 16542 | 18.14 | 515 | 0.0352 | Excellent |
| Very High GPU | 721 | 16544 | 18.13 | 522 | 0.0347 | Excellent |
| Medium GPU | 1486 | 17813 | 16.84 | 679 | 0.0248 | Excellent |
| Light GPU | 2038 | 22837 | 13.14 | 873 | 0.0150 | Very Good |
| CPU Only | 2432 | 24475 | 12.26 | 1071 | 0.0114 | Good |
QCM8550 Recommendations
- For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 23.47 tok/s (1.91x speedup over CPU-only)
- For Best Efficiency: Use Optimal GPU configuration - 0.0556 tok/s/MB
- For Lowest Memory: Use Optimal GPU configuration - 422 MB
- For Best Battery Life: Use Optimal GPU configuration - 23472 tok/s per % battery
Comparison
A direct comparison between the QCM6490 and QCM8550 chipsets reveals significant performance differences and optimization opportunities.
Performance Comparison
Optimal Configuration Performance (tok/s)
Memory Usage (MB)
Efficiency (tok/s/MB)
The QCM8550 chipset demonstrates superior performance across all key metrics, with particularly dramatic improvements in memory efficiency. This makes it an ideal choice for deployments where multiple AI models need to run concurrently or where memory constraints are a concern.
GPU Configurations
Understanding the impact of different GPU/CPU workload distributions is crucial for optimizing Generate Nano's performance on Qualcomm chipsets.
Configuration Matrix
| Configuration | GPU Layers | CPU Threads | Mode | Description |
|---|---|---|---|---|
| CPU Only | 0 | 4 | CPU Only | All processing on CPU cores, no GPU acceleration |
| Light GPU | 8 | 4 | Hybrid (Balanced) | Minimal GPU offloading, most work on CPU |
| Medium GPU | 16 | 4 | Hybrid (Balanced) | Balanced workload between CPU and GPU |
| Balanced GPU | 24 | 4 | Hybrid (GPU-Heavy) | More work on GPU than CPU |
| Optimal GPU | 27 | 4 | Hybrid (GPU-Heavy) | Optimal balance for both chipsets |
| Custom (27L/4T) | 27 | 4 | Hybrid (GPU-Heavy) | Same as Optimal GPU, with custom parameters |
| High GPU | 32 | 4 | High GPU | Heavy GPU utilization |
| Very High GPU | 48 | 2 | Max GPU | Maximum GPU offloading, minimal CPU |
Our testing reveals that the optimal configuration for both chipsets is 27 GPU layers with 4 CPU threads. This configuration achieves the best balance between performance and efficiency, leveraging the strengths of both the GPU and CPU.
Optimal Settings
Based on our comprehensive benchmarking, we recommend the following optimal settings for Generate Nano deployments on Qualcomm chipsets.
Recommended Configuration
QCM6490 Optimal Settings
{
"model_config": {
"gpu_layers": 27,
"cpu_threads": 4,
"processing_mode": "hybrid_gpu_heavy",
"context_size": 2048,
"batch_size": 512
},
"memory_config": {
"max_memory_mb": 1600,
"prefill_chunk_size": 512,
"decode_chunk_size": 128
},
"performance_config": {
"enable_kv_cache": true,
"enable_attention_split": true,
"enable_flash_attention": true,
"enable_tensor_split": false
}
}
Implementation Notes
- The QCM6490 benefits significantly from GPU acceleration but shows diminishing returns beyond 27 GPU layers.
- Memory usage remains relatively high across all configurations, so ensure your application has sufficient memory allocation.
- For battery-constrained devices, the Optimal GPU configuration provides the best balance of performance and power efficiency.
- If memory is a primary concern, the CPU Only configuration uses the least memory but at a significant performance cost.
QCM8550 Optimal Settings
{
"model_config": {
"gpu_layers": 27,
"cpu_threads": 4,
"processing_mode": "hybrid_gpu_heavy",
"context_size": 2048,
"batch_size": 512
},
"memory_config": {
"max_memory_mb": 500,
"prefill_chunk_size": 512,
"decode_chunk_size": 128
},
"performance_config": {
"enable_kv_cache": true,
"enable_attention_split": true,
"enable_flash_attention": true,
"enable_tensor_split": true
}
}
Implementation Notes
- The QCM8550 shows exceptional memory efficiency, using only 422MB in the optimal configuration.
- Unlike the QCM6490, the QCM8550 maintains excellent performance even in High GPU configurations.
- The Optimal GPU configuration provides the best results across all metrics: performance, efficiency, memory usage, and battery life.
- Enable tensor_split for the QCM8550 to take advantage of its advanced memory management capabilities.
Recommendations
Based on our benchmarking results, we provide the following recommendations for deploying Generate Nano on Qualcomm chipsets.
Deployment Recommendations
For QCM6490 Deployments
- Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best balance of performance and efficiency.
- Monitor Memory Usage: Ensure your application has sufficient memory allocation (at least 1.6GB) for optimal performance.
- Avoid High GPU Configurations: Configurations with more than 27 GPU layers show decreased performance on the QCM6490.
- Consider Battery Impact: For battery-constrained devices, the Optimal GPU configuration provides the best performance per battery percentage.
For QCM8550 Deployments
- Leverage Memory Efficiency: The QCM8550's exceptional memory efficiency allows for running multiple models concurrently or using larger context windows.
- Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best results across all metrics.
- Enable Advanced Features: The QCM8550 benefits from enabling tensor_split and flash_attention optimizations.
- Consider Multi-Model Deployments: The low memory footprint (422MB) makes the QCM8550 ideal for applications requiring multiple AI models.
General Recommendations
- Optimize for Your Use Case: Consider your specific requirements (performance, memory, battery life) when selecting a configuration.
- Test with Real-World Data: While our benchmarks provide a good baseline, testing with your specific prompts and workloads is recommended.
- Monitor Temperature: Extended AI inference can increase device temperature. Implement thermal monitoring and throttling if necessary.
- Update Regularly: Future SDK updates may provide additional optimizations for both chipsets.