Chimera GPNPU

Programmable Like a CPU.
Efficient Like an Accelerator.

Chimera is a single licensable core—AXI-compliant, scalable from 1 to 864 TOPS. The unified pipeline handles matrix, vector, and scalar operations in one execution stream. No partitioning. No split toolchains. When the graph compiler can't handle an operator, C++ gets you out. Built to outlast your product lifecycle.

1 to 864 TOPSAll MarketsASIL-Ready

Request Datasheet Schedule a Meeting

The Challenge

From Complexity to Simplicity

The Problem

The Multi-IP Mess

Separate NPU, DSP, CPU. Multiple vendors. Months wiring them together.

Multiple IP Blocks

Separate NPU, DSP, and CPU components that must be integrated, debugged, and maintained independently.

Split Toolchains & Codestreams

Each processor requires its own compiler and debugger. AI workloads partitioned across cores.

Fixed-Function NPUs

Hardware optimized for last year's models. New operators require silicon updates.

The Quadric Approach

One Unified Core

Single Core. 100% C++ Programmable. Single Binary.

Single Unified Core

Matrix, vector, and scalar operations in one execution pipeline. No partitioning.

100% C++ Programmable

New operators added via C++ kernels after deployment. Never blocked by silicon.

Single Unified Binary

One codestream, one toolchain, one debug environment. ONNX and C++ merge seamlessly.

Explore the SDK View Benchmarks

Why Chimera

Benefits of Chimera GPNPU

Simplify your SoC design and speed up porting of new AI models

System Simplicity

Quadric's solution enables hardware developers to instantiate a single core that can handle an entire AI/ML workload plus the typical digital signal processor functions and signal conditioning workloads often intermixed with inference functions. Dealing with a single core drastically simplifies hardware integration and eases performance optimization. System design tasks such as profiling memory usage and estimating system power consumption are greatly simplified.

Programming Simplicity

Quadric's Chimera GPNPU architecture dramatically simplifies software development since matrix, vector, and control code can all be handled in a single code stream. Graph code from the common training toolsets (TensorFlow, PyTorch, ONNX formats) is compiled by the Quadric SDK and can be merged with signal processing code written in C++, all compiled into a single code stream running on a single processor core. The entire subsystem can be debugged in a single debug console.

Future-Proof Flexibility

A Chimera GPNPU can run any AI/ML graph that can be captured in ONNX, and anything written in C++. This is incredibly powerful since SoC developers can quickly write code to implement new neural network operators and libraries long after the SoC has been taped out. This eliminates fear of the unknown and dramatically increases a chip's useful life. As ML models continue to evolve, the payoff from this unified architecture helps future-proof chip design cycles.

How It Works

Unified by Design

Accelerator-level performance with full processor flexibility

Designed from the ground up to address the constantly evolving AI inference deployment challenges facing system on chip (SoC) developers, the Chimera GPNPU family has a simple yet powerful architecture with demonstrated improved matrix-computation performance over the traditional approach.

Unified Pipeline

Matrix, vector and scalar code in one execution pipeline. No partitioning required.

Code-Driven

Continuously optimize performance throughout a device's lifecycle via software updates.

Future-Ready

Runs classic backbones, Transformers, LLMs, and networks not yet invented.

Chimera GPNPU Block Diagram

A hybrid Von Neumann + 2D SIMD architecture that unifies matrix, vector, and scalar operations in a single execution pipeline

The Chimera GPNPU is entirely driven by code, empowering developers to continuously optimize the performance of their models and algorithms throughout the device's lifecycle. That's why it's ideal to run classic backbone networks, today's newest Transformers and Large Language Models, and whatever new networks are invented tomorrow.

Modern System-on-Chip architectures deploy complex algorithms that mix traditional C++ based code with newly emerging and fast-changing machine learning inference code. This combination is found in numerous chip subsystems, most prominently in vision and imaging subsystems, radar and lidar processing, communications baseband subsystems, and a variety of other data-rich processing pipelines.

Unlike heterogeneous alternatives requiring splitting AI/ML graph execution and tuning performance across two or three heterogeneous cores, the Chimera GPNPU operates as a single software-controlled core, allowing for simple expression of complex parallel workloads.

Technical Specifications

Key Architectural Features

A hybrid Von Neumann + 2D SIMD architecture optimized for AI/ML inference

Architecture

Hybrid Von Neumann + 2D SIMD matrix architecture
7-stage, in-order pipeline
64b instruction word, single issue per clock
Scalar/vector/matrix instructions intermixed with granular predication

Memory System

Distributed local register memories (LRM) with data broadcast networks
L2 data memory (1MB to 16MB) to minimize off-chip DDR access
Configurable instruction cache (64/128/256K options)
Configurable AXI interfaces for independent data and instruction access

Processing

Optimized for INT8 ML inference with optional FP16 and A8W4 MAC
32b ALU DSP operations for full C++ compiler support
Deterministic, non-speculative execution for predictable performance

Power Efficiency

Compiler-driven, fine-grained clock gating
Hierarchical memory minimizes power-hungry DDR access
Overlapped compute and data movement within matrix array

Product Portfolio

The Chimera GPNPU Family

Spanning from single-core QC Nano to 8-way QC-Multi clusters. Fully synthesizable for any process technology.

Nano1-7 TOPs

Perform4-28 TOPs

Ultra16-108 TOPs

Multi32-864 TOPs

Performance scaling across process nodes and configurations

The Chimera QC processor family spans a wide range of performance requirements. As a fully synthesizable processor, you can implement a Chimera IP core in any process technology, from older nodes to the most advanced technologies. There is a Chimera processor that meets your performance goals for high-volume end applications including mobile devices, digital home applications, automotive and network edge compute systems.

Power Efficiency

Memory Optimization = Power Minimization

ML inference is a data & memory movement optimization problem, not a compute efficiency problem.

ML/AI inference solutions are most often performance- and power-dissipation-limited by memory system bandwidth utilization. With most state-of-the-art AI models having millions or billions of parameters, fitting an entire model into on-chip memory within an advanced System-on-a-Chip is generally not possible. Therefore, smart management of available on-chip data storage of both weights and activations is a prerequisite to achieving high efficiency.

Memory Hierarchy & Energy Cost

32b write from ALU/MAC

Local Register Memories·Distributed registers enabling overlapped compute and data movement

2-3X

L2 Memory·1-16 MB configurable SRAM buffer, compiler-managed for temporal data

70X

Off-chip DDR·Vast storage for largest AI models, with significant power/cycle penalties

225X

Key Insight: Compiler optimizations that keep data resident in the Register File or LRM yield significant power savings. The Chimera processor family solves memory management limitations by being fully programmable and powered by compiler-driven DMA management.

Multi-Layer Memory Optimization

Chimera Graph Compiler (CGC) manages data movement across the memory hierarchy

Many second-generation NPU accelerators are hardwired finite state machines (FSMs) that offload several performance intensive building-block AI operators. These FSM solutions deliver high efficiency only if the ultimate network does not waver from the limited scope of operators hard-coded into the silicon. A FSM solution does not allow for future fine-tuning of memory management strategies as network workloads evolve.

Technology Comparison

GPNPU vs NPU

Understanding the key differences between traditional NPUs and General Purpose NPUs

Feature

NPU

GPNPU

Programmability

Fixed-function hardware blocks

Fully programmable processor

New Operators

Requires silicon revision

Write as C++ kernels post-deployment

Architecture

Requires CPU/DSP companion

Standalone unified core

Code Streams

Split across heterogeneous cores

Single unified codestream

Matrix Performance

High

Ready to Evaluate?

Get the datasheet. Talk to our architects. See the benchmarks.

Request Datasheet Schedule a Meeting View Benchmarks

Programmable Like a CPU.
Efficient Like an Accelerator.

1 to 864 TOPSAll MarketsASIL-Ready

Programmable Like a CPU.Efficient Like an Accelerator.