GPU Benchmarking

Comprehensive performance analysis of Generate Nano on Qualcomm QCM6490 and QCM8550 chipsets.

Overview

Generate Nano has been optimized for Qualcomm's QCM6490 and QCM8550 chipsets, leveraging their GPU capabilities to maximize performance for on-device AI inference. This benchmarking report provides detailed performance metrics across various GPU/CPU configurations to help you determine the optimal settings for your deployment.

Executive Summary

QCM6490 Baseline

7.75 tok/s

QCM6490 Best

14.77 tok/s

QCM8550 Baseline

12.26 tok/s

QCM8550 Best

23.47 tok/s

Our benchmarking tests demonstrate that the QCM8550 chipset delivers approximately 59% higher performance than the QCM6490 when comparing optimal configurations. Both chipsets show significant performance improvements when utilizing GPU acceleration compared to CPU-only processing.

Key Findings

  • Optimal GPU Configuration: For both chipsets, the "Optimal GPU" configuration (27 GPU layers, 4 CPU threads) provides the best balance of performance and efficiency.
  • Diminishing Returns: Configurations with more than 27 GPU layers show diminishing returns and can even decrease performance.
  • Memory Usage: The QCM8550 demonstrates significantly lower memory usage (422MB vs 1560MB) while delivering higher performance.
  • Efficiency: The QCM8550 achieves 5.85x better efficiency (tok/s/MB) compared to the QCM6490.

Chipset Comparison

Metric QCM6490 QCM8550 Improvement
Peak Performance 14.77 tok/s 23.47 tok/s +59%
CPU Baseline 7.75 tok/s 12.26 tok/s +58%
Memory Usage (Optimal) 1560 MB 422 MB -73%
Efficiency (tok/s/MB) 0.0095 0.0556 +485%
Time to First Token 1815 ms 742 ms -59%
Total Processing Time 20310 ms 12781 ms -37%

The QCM8550 chipset demonstrates superior performance across all metrics, with particularly significant improvements in memory efficiency and response time. The reduced memory footprint allows for more complex models or concurrent applications on the same device.

Methodology

Our benchmarking methodology ensures consistent and reliable performance measurements across different configurations and chipsets.

Test Environment

All tests were conducted using the following standardized environment:

  • Model: Interplay Think 0.6B Q4_0 (4-bit quantized, 600M parameters)
  • Prompt: Technical AI/ML explanation (~300 tokens)
  • Response Length: 300 tokens (consistent across all tests)
  • Temperature: 0.7
  • Ambient Temperature: 22-24°C (controlled environment)
  • Battery State: Plugged in, 100% charged
  • Background Processes: Minimized to ensure consistent results

Metrics Collected

  • TTFT (Time to First Token): Time from query submission to first token generation
  • Total Time: Complete processing time for the entire response
  • TPS (Tokens Per Second): Average number of tokens generated per second
  • Memory Usage: Peak and average memory consumption during inference
  • Power Efficiency: Tokens per second per percentage of battery used

Configuration Matrix

We tested multiple configurations with varying distributions of workload between GPU and CPU:

  • GPU Layers: Number of transformer layers assigned to the GPU (0-48)
  • CPU Threads: Number of CPU threads utilized (4 across all tests)
  • Processing Mode: CPU Only, Hybrid (Balanced), Hybrid (GPU-Heavy), High GPU, Max GPU

QCM6490 Benchmarks

The Qualcomm QCM6490 chipset demonstrates strong performance for on-device AI inference, particularly when utilizing the optimal GPU configuration.

Performance Metrics

QCM6490

Tokens Per Second (TPS)

Optimal GPU
14.77 tok/s
Custom (27L/4T)
14.76 tok/s
Balanced GPU
13.75 tok/s
Medium GPU
10.87 tok/s
Light GPU
8.96 tok/s
CPU Only
7.75 tok/s
High GPU
5.38 tok/s
Very High GPU
5.36 tok/s
Configuration TTFT (ms) Total (ms) TPS Memory (MB) Efficiency Rating
Optimal GPU 1815 20310 14.77 1560 0.0095 Very Good
Custom (27L/4T) 1820 20326 14.76 1546 0.0095 Very Good
Balanced GPU 2156 21824 13.75 1547 0.0089 Very Good
Medium GPU 3290 27608 10.87 1537 0.0071 Good
Light GPU 4411 33496 8.96 1549 0.0058 Fair
CPU Only 5426 38706 7.75 1453 0.0053 Fair
High GPU 1785 55724 5.38 1820 0.0030 Poor
Very High GPU 1804 55996 5.36 1830 0.0029 Poor

QCM6490 Recommendations

  • For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 14.77 tok/s (1.91x speedup over CPU-only)
  • For Best Efficiency: Use Custom (27L/4T) configuration - 0.0095 tok/s/MB
  • For Lowest Memory: Use CPU Only configuration - 1453 MB
  • For Best Battery Life: Use Optimal GPU configuration - 14771 tok/s per % battery

QCM8550 Benchmarks

The Qualcomm QCM8550 chipset delivers exceptional performance for on-device AI inference, with significant improvements in both speed and efficiency compared to the QCM6490.

Performance Metrics

QCM8550

Tokens Per Second (TPS)

Optimal GPU
23.47 tok/s
Balanced GPU
22.56 tok/s
High GPU
18.14 tok/s
Very High GPU
18.13 tok/s
Medium GPU
16.84 tok/s
Light GPU
13.14 tok/s
CPU Only
12.26 tok/s
Configuration TTFT (ms) Total (ms) TPS Memory (MB) Efficiency Rating
Optimal GPU 742 12781 23.47 422 0.0556 Excellent
Balanced GPU 911 13296 22.56 478 0.0472 Excellent
High GPU 684 16542 18.14 515 0.0352 Excellent
Very High GPU 721 16544 18.13 522 0.0347 Excellent
Medium GPU 1486 17813 16.84 679 0.0248 Excellent
Light GPU 2038 22837 13.14 873 0.0150 Very Good
CPU Only 2432 24475 12.26 1071 0.0114 Good

QCM8550 Recommendations

  • For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 23.47 tok/s (1.91x speedup over CPU-only)
  • For Best Efficiency: Use Optimal GPU configuration - 0.0556 tok/s/MB
  • For Lowest Memory: Use Optimal GPU configuration - 422 MB
  • For Best Battery Life: Use Optimal GPU configuration - 23472 tok/s per % battery

Comparison

A direct comparison between the QCM6490 and QCM8550 chipsets reveals significant performance differences and optimization opportunities.

Performance Comparison

Optimal Configuration Performance (tok/s)

QCM8550
23.47 tok/s
QCM6490
14.77 tok/s

Memory Usage (MB)

QCM6490
1560 MB
QCM8550
422 MB

Efficiency (tok/s/MB)

QCM8550
0.0556
QCM6490
0.0095

The QCM8550 chipset demonstrates superior performance across all key metrics, with particularly dramatic improvements in memory efficiency. This makes it an ideal choice for deployments where multiple AI models need to run concurrently or where memory constraints are a concern.

GPU Configurations

Understanding the impact of different GPU/CPU workload distributions is crucial for optimizing Generate Nano's performance on Qualcomm chipsets.

Configuration Matrix

Configuration GPU Layers CPU Threads Mode Description
CPU Only 0 4 CPU Only All processing on CPU cores, no GPU acceleration
Light GPU 8 4 Hybrid (Balanced) Minimal GPU offloading, most work on CPU
Medium GPU 16 4 Hybrid (Balanced) Balanced workload between CPU and GPU
Balanced GPU 24 4 Hybrid (GPU-Heavy) More work on GPU than CPU
Optimal GPU 27 4 Hybrid (GPU-Heavy) Optimal balance for both chipsets
Custom (27L/4T) 27 4 Hybrid (GPU-Heavy) Same as Optimal GPU, with custom parameters
High GPU 32 4 High GPU Heavy GPU utilization
Very High GPU 48 2 Max GPU Maximum GPU offloading, minimal CPU

Our testing reveals that the optimal configuration for both chipsets is 27 GPU layers with 4 CPU threads. This configuration achieves the best balance between performance and efficiency, leveraging the strengths of both the GPU and CPU.

Optimal Settings

Based on our comprehensive benchmarking, we recommend the following optimal settings for Generate Nano deployments on Qualcomm chipsets.

Recommended Configuration

QCM6490 Optimal Settings

{
  "model_config": {
    "gpu_layers": 27,
    "cpu_threads": 4,
    "processing_mode": "hybrid_gpu_heavy",
    "context_size": 2048,
    "batch_size": 512
  },
  "memory_config": {
    "max_memory_mb": 1600,
    "prefill_chunk_size": 512,
    "decode_chunk_size": 128
  },
  "performance_config": {
    "enable_kv_cache": true,
    "enable_attention_split": true,
    "enable_flash_attention": true,
    "enable_tensor_split": false
  }
}

Implementation Notes

  • The QCM6490 benefits significantly from GPU acceleration but shows diminishing returns beyond 27 GPU layers.
  • Memory usage remains relatively high across all configurations, so ensure your application has sufficient memory allocation.
  • For battery-constrained devices, the Optimal GPU configuration provides the best balance of performance and power efficiency.
  • If memory is a primary concern, the CPU Only configuration uses the least memory but at a significant performance cost.

QCM8550 Optimal Settings

{
  "model_config": {
    "gpu_layers": 27,
    "cpu_threads": 4,
    "processing_mode": "hybrid_gpu_heavy",
    "context_size": 2048,
    "batch_size": 512
  },
  "memory_config": {
    "max_memory_mb": 500,
    "prefill_chunk_size": 512,
    "decode_chunk_size": 128
  },
  "performance_config": {
    "enable_kv_cache": true,
    "enable_attention_split": true,
    "enable_flash_attention": true,
    "enable_tensor_split": true
  }
}

Implementation Notes

  • The QCM8550 shows exceptional memory efficiency, using only 422MB in the optimal configuration.
  • Unlike the QCM6490, the QCM8550 maintains excellent performance even in High GPU configurations.
  • The Optimal GPU configuration provides the best results across all metrics: performance, efficiency, memory usage, and battery life.
  • Enable tensor_split for the QCM8550 to take advantage of its advanced memory management capabilities.

Recommendations

Based on our benchmarking results, we provide the following recommendations for deploying Generate Nano on Qualcomm chipsets.

Deployment Recommendations

For QCM6490 Deployments

  • Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best balance of performance and efficiency.
  • Monitor Memory Usage: Ensure your application has sufficient memory allocation (at least 1.6GB) for optimal performance.
  • Avoid High GPU Configurations: Configurations with more than 27 GPU layers show decreased performance on the QCM6490.
  • Consider Battery Impact: For battery-constrained devices, the Optimal GPU configuration provides the best performance per battery percentage.

For QCM8550 Deployments

  • Leverage Memory Efficiency: The QCM8550's exceptional memory efficiency allows for running multiple models concurrently or using larger context windows.
  • Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best results across all metrics.
  • Enable Advanced Features: The QCM8550 benefits from enabling tensor_split and flash_attention optimizations.
  • Consider Multi-Model Deployments: The low memory footprint (422MB) makes the QCM8550 ideal for applications requiring multiple AI models.

General Recommendations

  • Optimize for Your Use Case: Consider your specific requirements (performance, memory, battery life) when selecting a configuration.
  • Test with Real-World Data: While our benchmarks provide a good baseline, testing with your specific prompts and workloads is recommended.
  • Monitor Temperature: Extended AI inference can increase device temperature. Implement thermal monitoring and throttling if necessary.
  • Update Regularly: Future SDK updates may provide additional optimizations for both chipsets.