Chimera GPNPU delivers on-device LLM inference with multi-core clusters and chiplet scaling. Licensed by customers building LLM-capable chips today.
0B
Single-Cluster
0+
Tokens/sec
2 - 8
Cores per Cluster
0B
Via Chiplets
See It In Action
Working FPGA demo running at 50MHz—watch QWEN 8B inference on Chimera GPNPU in real-time.
Get a personalized walkthrough of LLM capabilities on Chimera
Customer Validated
Chimera GPNPU has been licensed by customers specifically for LLM workloads. Our implementation has been validated on customer emulation platforms, proving production readiness.
Customer License
Customer licenses Chimera GPNPU IP for LLM-capable chip design
Development
Customer porting of new LLM using SDK toolkit with attention, pre-fill, and KVcache building blocks
Validation
Full model validated on customer's emulation platform with correct outputs
The RTL has been validated on real customer emulation environments. De-risked for your design.
New LLM models compile from ONNX to optimized C++ in weeks. The CGC compiler handles attention layers, position encoding, and quantization.
When the next breakthrough LLM is released, port it in software— no silicon respin required. Your chip stays competitive.
Weeks
New Model Porting
8B
Parameters Validated
4
Cores in Cluster
Scalable Architecture
Chimera GPNPU scales seamlessly from single-core edge deployments to multi-die chiplet configurations, supporting LLMs up to 30B parameters.
Performance
Performance validated on emulation with INT4 quantization. Results shown for various LLM sizes on multi-core configurations.
0.5B parameters · e.g., QWEN 2.5
0.6B parameters · e.g., QWEN 3
1.7B parameters · e.g., QWEN 3
4B parameters · e.g., QWEN 3
8B parameters · e.g., QWEN 3, LLaMA
* Smaller models on single-core config. 8B model on 4-core cluster at 1GHz.
Validated Performance
8B parameter LLM on 4-core cluster with INT4 weights, validated on customer emulation platform.
~500ms
Time to first token (512 ctx)
4GB
INT4 weight footprint
W8A8
8-bit weights, 8-bit activations
W4A8
4-bit weights, 8-bit activations
INT4
Full INT4 computation
Memory-Bound Optimization
LLM autoregressive inference is memory-bandwidth bound. Chimera's architecture maximizes bandwidth utilization with optimized weight streaming and KV cache management.
Model Support
Chimera's programmable architecture supports any transformer-based LLM. When a new model is released, port it in software—no silicon changes required.
Architecture Support
Chimera supports standard transformer architectures used by modern LLMs. Port any model from ONNX to optimized C++.
Alibaba
QWEN 2.5 and QWEN 3 models validated on customer emulation. 8B model running at 20+ tok/sec.
Meta
Meta's foundational LLM family with multi-core implementation available.
Custom
Bring your own model. CGC compiler handles ONNX conversion to optimized C++ automatically.
Core architectural features supported for modern LLMs
GQA support for efficient key/value sharing across query heads
Rotary position embeddings via optimized custom operations
Efficient autoregressive decoding with optimized cache management
Support for 150K+ token vocabularies with efficient gather operations
New models in weeks: CGC compiler handles transformer architectures automatically. Port from ONNX without RTL changes.
Whether you're designing edge AI devices, automotive systems, or consumer electronics, Chimera GPNPU delivers the on-device LLM inference your customers demand.
Get detailed documentation, discuss your use case, and learn how Chimera can accelerate your LLM-enabled product roadmap.
Request More InformationSign in to your existing account or create a new one to explore Chimera SDK and run models in our cloud development environment.
Launch DevStudio