Generate Nano Blueprint

Overview

Generate Nano has been optimized for Qualcomm's QCM6490 and QCM8550 chipsets, leveraging their GPU capabilities to maximize performance for on-device AI inference. This benchmarking report provides detailed performance metrics across various GPU/CPU configurations to help you determine the optimal settings for your deployment.

Executive Summary

QCM6490 Baseline

7.75 tok/s

QCM6490 Best

14.77 tok/s

QCM8550 Baseline

12.26 tok/s

QCM8550 Best

23.47 tok/s

Our benchmarking tests demonstrate that the QCM8550 chipset delivers approximately 59% higher performance than the QCM6490 when comparing optimal configurations. Both chipsets show significant performance improvements when utilizing GPU acceleration compared to CPU-only processing.

Key Findings

Optimal GPU Configuration: For both chipsets, the "Optimal GPU" configuration (27 GPU layers, 4 CPU threads) provides the best balance of performance and efficiency.
Diminishing Returns: Configurations with more than 27 GPU layers show diminishing returns and can even decrease performance.
Memory Usage: The QCM8550 demonstrates significantly lower memory usage (422MB vs 1560MB) while delivering higher performance.
Efficiency: The QCM8550 achieves 5.85x better efficiency (tok/s/MB) compared to the QCM6490.

Chipset Comparison

Metric	QCM6490	QCM8550	Improvement
Peak Performance	14.77 tok/s	23.47 tok/s	+59%
CPU Baseline	7.75 tok/s	12.26 tok/s	+58%
Memory Usage (Optimal)	1560 MB	422 MB	-73%
Efficiency (tok/s/MB)	0.0095	0.0556	+485%
Time to First Token	1815 ms	742 ms	-59%
Total Processing Time	20310 ms	12781 ms	-37%

The QCM8550 chipset demonstrates superior performance across all metrics, with particularly significant improvements in memory efficiency and response time. The reduced memory footprint allows for more complex models or concurrent applications on the same device.

Methodology

Our benchmarking methodology ensures consistent and reliable performance measurements across different configurations and chipsets.

Test Environment

All tests were conducted using the following standardized environment:

Model: Interplay Think 0.6B Q4_0 (4-bit quantized, 600M parameters)
Prompt: Technical AI/ML explanation (~300 tokens)
Response Length: 300 tokens (consistent across all tests)
Temperature: 0.7
Ambient Temperature: 22-24°C (controlled environment)
Battery State: Plugged in, 100% charged
Background Processes: Minimized to ensure consistent results

Metrics Collected

TTFT (Time to First Token): Time from query submission to first token generation
Total Time: Complete processing time for the entire response
TPS (Tokens Per Second): Average number of tokens generated per second
Memory Usage: Peak and average memory consumption during inference
Power Efficiency: Tokens per second per percentage of battery used

Configuration Matrix

We tested multiple configurations with varying distributions of workload between GPU and CPU:

GPU Layers: Number of transformer layers assigned to the GPU (0-48)
CPU Threads: Number of CPU threads utilized (4 across all tests)
Processing Mode: CPU Only, Hybrid (Balanced), Hybrid (GPU-Heavy), High GPU, Max GPU

QCM6490 Benchmarks

The Qualcomm QCM6490 chipset demonstrates strong performance for on-device AI inference, particularly when utilizing the optimal GPU configuration.

Performance Metrics

QCM6490

Tokens Per Second (TPS)

Optimal GPU

14.77 tok/s

Custom (27L/4T)

14.76 tok/s

Balanced GPU

13.75 tok/s

Medium GPU

10.87 tok/s

Light GPU

8.96 tok/s

CPU Only

7.75 tok/s

High GPU

5.38 tok/s

Very High GPU

5.36 tok/s

Configuration	TTFT (ms)	Total (ms)	TPS	Memory (MB)	Efficiency	Rating
Optimal GPU	1815	20310	14.77	1560	0.0095	Very Good
Custom (27L/4T)	1820	20326	14.76	1546	0.0095	Very Good
Balanced GPU	2156	21824	13.75	1547	0.0089	Very Good
Medium GPU	3290	27608	10.87	1537	0.0071	Good
Light GPU	4411	33496	8.96	1549	0.0058	Fair
CPU Only	5426	38706	7.75	1453	0.0053	Fair
High GPU	1785	55724	5.38	1820	0.0030	Poor
Very High GPU	1804	55996	5.36	1830	0.0029	Poor

QCM6490 Recommendations

For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 14.77 tok/s (1.91x speedup over CPU-only)
For Best Efficiency: Use Custom (27L/4T) configuration - 0.0095 tok/s/MB
For Lowest Memory: Use CPU Only configuration - 1453 MB
For Best Battery Life: Use Optimal GPU configuration - 14771 tok/s per % battery

QCM8550 Benchmarks

The Qualcomm QCM8550 chipset delivers exceptional performance for on-device AI inference, with significant improvements in both speed and efficiency compared to the QCM6490.

Performance Metrics

QCM8550

Tokens Per Second (TPS)

Optimal GPU

23.47 tok/s

Balanced GPU

22.56 tok/s

High GPU

18.14 tok/s

Very High GPU

18.13 tok/s

Medium GPU

16.84 tok/s

Light GPU

13.14 tok/s

CPU Only

12.26 tok/s

Configuration	TTFT (ms)	Total (ms)	TPS	Memory (MB)	Efficiency	Rating
Optimal GPU	742	12781	23.47	422	0.0556	Excellent
Balanced GPU	911	13296	22.56	478	0.0472	Excellent
High GPU	684	16542	18.14	515	0.0352	Excellent
Very High GPU	721	16544	18.13	522	0.0347	Excellent
Medium GPU	1486	17813	16.84	679	0.0248	Excellent
Light GPU	2038	22837	13.14	873	0.0150	Very Good
CPU Only	2432	24475	12.26	1071	0.0114	Good

QCM8550 Recommendations

For Maximum Performance: Use Optimal GPU configuration (27 GPU layers, 4 CPU threads) - 23.47 tok/s (1.91x speedup over CPU-only)
For Best Efficiency: Use Optimal GPU configuration - 0.0556 tok/s/MB
For Lowest Memory: Use Optimal GPU configuration - 422 MB
For Best Battery Life: Use Optimal GPU configuration - 23472 tok/s per % battery

Comparison

A direct comparison between the QCM6490 and QCM8550 chipsets reveals significant performance differences and optimization opportunities.

Performance Comparison

Optimal Configuration Performance (tok/s)

QCM8550

23.47 tok/s

QCM6490

14.77 tok/s

Memory Usage (MB)

QCM6490

1560 MB

QCM8550

422 MB

Efficiency (tok/s/MB)

QCM8550

0.0556

QCM6490

0.0095

The QCM8550 chipset demonstrates superior performance across all key metrics, with particularly dramatic improvements in memory efficiency. This makes it an ideal choice for deployments where multiple AI models need to run concurrently or where memory constraints are a concern.

GPU Configurations

Understanding the impact of different GPU/CPU workload distributions is crucial for optimizing Generate Nano's performance on Qualcomm chipsets.

Configuration Matrix

Configuration	GPU Layers	CPU Threads	Mode	Description
CPU Only	0	4	CPU Only	All processing on CPU cores, no GPU acceleration
Light GPU	8	4	Hybrid (Balanced)	Minimal GPU offloading, most work on CPU
Medium GPU	16	4	Hybrid (Balanced)	Balanced workload between CPU and GPU
Balanced GPU	24	4	Hybrid (GPU-Heavy)	More work on GPU than CPU
Optimal GPU	27	4	Hybrid (GPU-Heavy)	Optimal balance for both chipsets
Custom (27L/4T)	27	4	Hybrid (GPU-Heavy)	Same as Optimal GPU, with custom parameters
High GPU	32	4	High GPU	Heavy GPU utilization
Very High GPU	48	2	Max GPU	Maximum GPU offloading, minimal CPU

Our testing reveals that the optimal configuration for both chipsets is 27 GPU layers with 4 CPU threads. This configuration achieves the best balance between performance and efficiency, leveraging the strengths of both the GPU and CPU.

Optimal Settings

Based on our comprehensive benchmarking, we recommend the following optimal settings for Generate Nano deployments on Qualcomm chipsets.

Recommended Configuration

QCM6490 Optimal Settings

{
  "model_config": {
    "gpu_layers": 27,
    "cpu_threads": 4,
    "processing_mode": "hybrid_gpu_heavy",
    "context_size": 2048,
    "batch_size": 512
  },
  "memory_config": {
    "max_memory_mb": 1600,
    "prefill_chunk_size": 512,
    "decode_chunk_size": 128
  },
  "performance_config": {
    "enable_kv_cache": true,
    "enable_attention_split": true,
    "enable_flash_attention": true,
    "enable_tensor_split": false
  }
}

Implementation Notes

The QCM6490 benefits significantly from GPU acceleration but shows diminishing returns beyond 27 GPU layers.
Memory usage remains relatively high across all configurations, so ensure your application has sufficient memory allocation.
For battery-constrained devices, the Optimal GPU configuration provides the best balance of performance and power efficiency.
If memory is a primary concern, the CPU Only configuration uses the least memory but at a significant performance cost.

QCM8550 Optimal Settings

{
  "model_config": {
    "gpu_layers": 27,
    "cpu_threads": 4,
    "processing_mode": "hybrid_gpu_heavy",
    "context_size": 2048,
    "batch_size": 512
  },
  "memory_config": {
    "max_memory_mb": 500,
    "prefill_chunk_size": 512,
    "decode_chunk_size": 128
  },
  "performance_config": {
    "enable_kv_cache": true,
    "enable_attention_split": true,
    "enable_flash_attention": true,
    "enable_tensor_split": true
  }
}

Implementation Notes

The QCM8550 shows exceptional memory efficiency, using only 422MB in the optimal configuration.
Unlike the QCM6490, the QCM8550 maintains excellent performance even in High GPU configurations.
The Optimal GPU configuration provides the best results across all metrics: performance, efficiency, memory usage, and battery life.
Enable tensor_split for the QCM8550 to take advantage of its advanced memory management capabilities.

Recommendations

Based on our benchmarking results, we provide the following recommendations for deploying Generate Nano on Qualcomm chipsets.

Deployment Recommendations

For QCM6490 Deployments

Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best balance of performance and efficiency.
Monitor Memory Usage: Ensure your application has sufficient memory allocation (at least 1.6GB) for optimal performance.
Avoid High GPU Configurations: Configurations with more than 27 GPU layers show decreased performance on the QCM6490.
Consider Battery Impact: For battery-constrained devices, the Optimal GPU configuration provides the best performance per battery percentage.

For QCM8550 Deployments

Leverage Memory Efficiency: The QCM8550's exceptional memory efficiency allows for running multiple models concurrently or using larger context windows.
Use the Optimal GPU Configuration: 27 GPU layers and 4 CPU threads provides the best results across all metrics.
Enable Advanced Features: The QCM8550 benefits from enabling tensor_split and flash_attention optimizations.
Consider Multi-Model Deployments: The low memory footprint (422MB) makes the QCM8550 ideal for applications requiring multiple AI models.

General Recommendations

Optimize for Your Use Case: Consider your specific requirements (performance, memory, battery life) when selecting a configuration.
Test with Real-World Data: While our benchmarks provide a good baseline, testing with your specific prompts and workloads is recommended.
Monitor Temperature: Extended AI inference can increase device temperature. Implement thermal monitoring and throttling if necessary.
Update Regularly: Future SDK updates may provide additional optimizations for both chipsets.

GPU Benchmarking

Overview

Executive Summary

QCM6490 Baseline

QCM6490 Best

QCM8550 Baseline

QCM8550 Best

Key Findings

Chipset Comparison

Methodology

Test Environment

Metrics Collected

Configuration Matrix

QCM6490 Benchmarks

Performance Metrics

Tokens Per Second (TPS)

QCM6490 Recommendations

QCM8550 Benchmarks

Performance Metrics

Tokens Per Second (TPS)

QCM8550 Recommendations

Comparison

Performance Comparison

Optimal Configuration Performance (tok/s)

Memory Usage (MB)

Efficiency (tok/s/MB)

GPU Configurations

Configuration Matrix

Optimal Settings

Recommended Configuration

QCM6490 Optimal Settings

Implementation Notes

QCM8550 Optimal Settings

Implementation Notes

Recommendations

Deployment Recommendations

For QCM6490 Deployments

For QCM8550 Deployments

General Recommendations