On-Device LLM

Run Any LLM On-Device with Chimera GPNPU

Chimera GPNPU delivers on-device LLM inference with multi-core clusters and chiplet scaling. Licensed by customers building LLM-capable chips today.

Single-Cluster

Tokens/sec

2 - 8

Cores per Cluster

Via Chiplets

Request More Information Try DevStudio

CHIMERA GPNPU

OCM

MLS INTERCONNECT

DDR5X

20+ t/s

LLM Output

CHIMERA GPNPU

OCM

MLS INTERCONNECT

DDR5X

20+ t/s

LLM Output

See It In Action

LLM Inference on Chimera

Working FPGA demo running at 50MHz—watch QWEN 8B inference on Chimera GPNPU in real-time.

Schedule a Demo

Get a personalized walkthrough of LLM capabilities on Chimera

Customer Validated

Licensed & Validated

Chimera GPNPU has been licensed by customers specifically for LLM workloads. Our implementation has been validated on customer emulation platforms, proving production readiness.

Customer License

LLM Use Case Selected

Customer licenses Chimera GPNPU IP for LLM-capable chip design

Development

Model Port & Optimization

Customer porting of new LLM using SDK toolkit with attention, pre-fill, and KVcache building blocks

Validation

Running on Emulator

Full model validated on customer's emulation platform with correct outputs

Proven RTL

The RTL has been validated on real customer emulation environments. De-risked for your design.

Rapid Model Porting

New LLM models compile from ONNX to optimized C++ in weeks. The CGC compiler handles attention layers, position encoding, and quantization.

Future-Proof Architecture

When the next breakthrough LLM is released, port it in software— no silicon respin required. Your chip stays competitive.

Weeks

New Model Porting

Parameters Validated

Cores in Cluster

Scalable Architecture

Scale From Single-Core to Multi-Die Chiplets

Chimera GPNPU scales seamlessly from single-core edge deployments to multi-die chiplet configurations, supporting LLMs up to 30B parameters.

Core

Chiplet

UCIe/D2D

Core

Single Core

Up to 1B

Multi-Core

Up to 8B

UCIe

Chiplet

Up to 8B

D2D

Multi-Chiplet

Up to 30B

Same software scales across all configurations

Performance

Proven Performance Across Model Sizes

Performance validated on emulation with INT4 quantization. Results shown for various LLM sizes on multi-core configurations.

Tokens per Second by Model Size

0.5B LLM1-CORE

42.9tok/s

0.5B parameters · e.g., QWEN 2.5

0.6B LLM1-CORE

37.7tok/s

0.6B parameters · e.g., QWEN 3

1.7B LLM1-CORE

18.7tok/s

1.7B parameters · e.g., QWEN 3

4B LLM1-CORE

9.7tok/s

4B parameters · e.g., QWEN 3

8B LLM4-CORE

20tok/s

8B parameters · e.g., QWEN 3, LLaMA

* Smaller models on single-core config. 8B model on 4-core cluster at 1GHz.

Validated Performance

20+tok/sec

8B parameter LLM on 4-core cluster with INT4 weights, validated on customer emulation platform.

~500ms

Time to first token (512 ctx)

4GB

INT4 weight footprint

Quantization Support

W8A8

8-bit weights, 8-bit activations

Supported

W4A8

4-bit weights, 8-bit activations

Supported

INT4

Full INT4 computation

Supported

Memory-Bound Optimization

LLM autoregressive inference is memory-bandwidth bound. Chimera's architecture maximizes bandwidth utilization with optimized weight streaming and KV cache management.

Model Support

Run Any LLM On-Device, Today and Tomorrow

Chimera's programmable architecture supports any transformer-based LLM. When a new model is released, port it in software—no silicon changes required.

Any Transformer LLM

Architecture Support

Native

Chimera supports standard transformer architectures used by modern LLMs. Port any model from ONNX to optimized C++.

Decoder-onlyGQAMHARoPE

QWEN Family

Alibaba

Validated

QWEN 2.5 and QWEN 3 models validated on customer emulation. 8B model running at 20+ tok/sec.

0.5B1.7B4B8B

LLaMA Family

Your Model

Custom

Portable

Bring your own model. CGC compiler handles ONNX conversion to optimized C++ automatically.

ONNX Import

Transformer Architecture Support

Core architectural features supported for modern LLMs

Grouped Query Attention

GQA support for efficient key/value sharing across query heads

RoPE Position Encoding

Rotary position embeddings via optimized custom operations

KV Cache Optimization

Efficient autoregressive decoding with optimized cache management

Large Vocabularies

Support for 150K+ token vocabularies with efficient gather operations

New models in weeks: CGC compiler handles transformer architectures automatically. Port from ONNX without RTL changes.

Build Your LLM-Capable Chip Today

Whether you're designing edge AI devices, automotive systems, or consumer electronics, Chimera GPNPU delivers the on-device LLM inference your customers demand.

Talk to Our Team

Get detailed documentation, discuss your use case, and learn how Chimera can accelerate your LLM-enabled product roadmap.

Request More Information

Try DevStudio

Sign in to your existing account or create a new one to explore Chimera SDK and run models in our cloud development environment.

Launch DevStudio

Licensed for LLM Use Cases

Validated on Emulation

Customer Taped Out

Build Your LLM-Capable Chip Today

Whether you're designing edge AI devices, automotive systems, or consumer electronics, Chimera GPNPU delivers the on-device LLM inference your customers demand.

Talk to Our Team

Get detailed documentation, discuss your use case, and learn how Chimera can accelerate your LLM-enabled product roadmap.

Request More Information

Try DevStudio

Sign in to your existing account or create a new one to explore Chimera SDK and run models in our cloud development environment.

Launch DevStudio

Licensed for LLM Use Cases

Validated on Emulation

Customer Taped Out