The Path to Ubiquitous AI, Visualized

AI inference is the most significant computational workload humanity has created. Every chat completion, every image generation, every code suggestion requires trillions of arithmetic operations.

The hardware running these workloads—GPUs designed for rendering pixels—was never built for this. A company called Taalas is rethinking the stack from the silicon up: custom chips where every transistor serves a single model. This is a deep dive into how that works.

The Memory Wall

When a language model generates a token, it reads its entire weight matrix from memory. For an 8B parameter model at 16-bit precision, that's ~16 gigabytes of data. Every single token.

The arithmetic itself is simple—multiply and accumulate. The bottleneck is feeding data to the compute units fast enough. DRAM is dense and cheap but takes ~100 nanoseconds to access. On-chip SRAM is fast (~1ns) but tiny and expensive.

COMPUTESRAM · ~1nsMEMORYDRAM · ~100nsBANDWIDTH BOTTLENECKACCESS LATENCYSRAM ~1nsDRAM ~100ns100×
On-chip SRAM is 100× faster than off-chip DRAM, but thousands of times less dense.

This is the memory wall. Compute speed has grown exponentially over decades, but memory bandwidth hasn't kept pace. For AI inference—which is almost entirely memory-bound—this is the fundamental constraint.

The GPU Approach

Modern GPUs address the memory wall with brute force. High Bandwidth Memory (HBM) stacks DRAM dies vertically, connected via silicon interposers, delivering terabytes per second of bandwidth. Thousands of cores operate in parallel.

This works, but it's expensive. HBM costs 5-10× more per gigabyte than standard DRAM. Silicon interposers are among the most expensive components in modern semiconductors. Power consumption reaches 300-700 watts per chip, requiring liquid cooling.

The entire modern AI infrastructure stack—advanced packaging, HBM, massive I/O bandwidth, liquid cooling—exists to work around the memory wall. What if you could eliminate it instead?

Compute Where the Data Lives

Taalas inverts the problem. Instead of moving data faster, they eliminate data movement entirely.

Their approach: embed compute circuits directly inside the memory array. Each memory cell stores a model weight AND performs the multiply-accumulate operation in place. The data never leaves the memory.

To understand what that means at the hardware level, compare how a single multiply operation works in each system.

Traditional: Read → Move → Compute
DRAM CELLweight stored0.73DATA BUStransferL2 CACHEcached copyALU0.73 × 1.2 = 0.88~100ns readtransferload~100ns write back~200ns per multiply · 4 hops
CIM: Multiply in Place
word-lineinput: 1.2SINGLE MEMORY CELLTweight0.73stored incapacitor×analog multiply= 0.88current on bit-linebit-line→ next cell~1ns · zero data movement
In a traditional system, data travels through 4 stages for one multiply. In compute-in-memory, the multiply happens where the weight is stored.
Traditional GPU
HBM MEMORY0.1 0.30.3 0.50.4 0.7weights0.3 0.40.5 0.60.6 0.8weights0.5 0.50.7 0.70.8 0.9weights0.7 0.60.9 0.81.0 1.0weightsBANDWIDTH BUSnarrownarrowGPU COMPUTE CORESbusybusyidleidleidleidlebusyidleidleidleidleidleidleidleidleidleidleidletime breakdown~80% waiting for datamathdata moves constantlymost cores sit idle, waiting for weights
Taalas CIM
MEMORY ARRAY = COMPUTE ENGINEx₁x₂x₃x₄0.3w×x0.7w×x0.1w×x0.5w×x0.8w×x0.2w×x0.6w×x0.4w×x0.5w×x0.9w×x0.3w×x0.7w×xy₁y₂y₃inside each cellstores weight+Tmultiplies=w × xin 1 nstime breakdown100% computing · zero data movementcompute happens where data livesevery cell stores a weight AND multiplies it
Traditional chips shuttle data between separate memory and compute through a narrow bus. Taalas processes data where it's stored—each cell holds a weight AND computes.

The memory array structure maps naturally to linear algebra. Each row stores weights for one output neuron. Each column carries one input activation. The result appears as current on the output lines—an entire matrix-vector multiply in a single cycle.

INPUT ACTIVATIONS (word-lines)1.0x10.5x20.8x30.3x40.30.70.10.50.80.20.60.40.50.90.30.70.10.40.80.2row 1row 2row 3row 4y10.88y21.50y31.40y41.00currents sum per row on bit-lines → output vectorone memory cycle = one full matrix-vector multiply
A memory crossbar array computes matrix-vector multiplication in one cycle. Each cell stores a weight and multiplies it by the column input. Row currents sum to produce the output.

This is compute-in-memory at DRAM-level density. The bit-line and word-line structure of a memory array maps naturally to matrix-vector multiplication—the core operation of neural network inference. Store the weight matrix in the array, apply input activations as voltages, and read out the result as currents. The entire multiply happens in one memory access cycle.

No HBM stacks. No silicon interposer. No massive I/O bandwidth. No liquid cooling.

The difference shows up in the physical hardware stack.

GPU System: Multi-Die Stack
PACKAGE SUBSTRATESILICON INTERPOSER$$$ advanced packagingCACHECORESSCHEDGPU DIEHBM STACK4× DRAM diesHBM STACK4× DRAM diesdata travels millimeters · 300-700W
Taalas HC1: Unified Die
PACKAGE SUBSTRATESINGLE DIE · 815mm²53B transistors · memory IS computeno interposer · no HBMdata travels nanometers · 200W
A GPU system requires HBM stacks, a silicon interposer, and advanced packaging — all to shuttle data. Taalas puts everything on a single die.

One Chip, One Model

Taalas goes further. Each chip is hardwired for a specific model's computation graph. To understand what that means, consider how a GPU actually runs a model.

A transformer like Llama 3.1 8B has a fixed structure: an embedding layer, then 32 identical transformer blocks (each containing self-attention and a feed-forward network), then a final output projection. This computation graph—the exact sequence of matrix multiplications—is known ahead of time and never changes.

On a GPU, this fixed graph is executed dynamically. For every operation, the processor fetches an instruction from memory, decodes it, schedules it to a free core, loads the relevant weights, executes the multiply, and writes the result back. Then repeats. Thousands of times per token.

GPU · instruction cycle
FETCHDECODESCHEDULELOAD WEIGHTSEXECUTEWRITE BACK× 4,096 per token
Taalas · hardwired pipeline
token inEMBEDLAYER 1NORMQKVATTENDFFNLAYER 32OUTPUTtoken out
The GPU repeats a 6-step instruction cycle thousands of times per token. In the ASIC, data flows straight through fixed silicon.

A Taalas chip eliminates all of that overhead. The model's computation graph is physically wired into the silicon. Layer 1's output connects directly to layer 2's input. The attention weights for each head sit in dedicated memory cells that also perform the multiply. There are no instructions to fetch, no cores to schedule, no results to shuttle around. The chip IS the model.

Think of it this way: a GPU is a programmable calculator—you enter each step, wait for the answer, enter the next step. A Taalas chip is a purpose-built machine—data flows in one end and the answer comes out the other.

General Purpose
Compute CoresCacheInstructionSchedulerRegistersI/OMem ControllerDMAidle during inference
~40% of silicon active during inference
Model Specific
EMBEDATTN 1–8FFN 1–8ATTN 9–16FFN 9–16ATTN 17–24FFN 17–24ATTN 25–32FFN 25–32OUTPUT
100% serves the model
On a GPU, ~60% of silicon is infrastructure that doesn't compute tokens. On a model-specific ASIC, every block runs part of the model.

This also enables aggressive quantization. Standard models use 16-bit precision (2 bytes per weight). Taalas co-designs the quantization with the hardware—3-bit and 6-bit formats on HC1—shrinking model size by 3-5× with minimal quality loss. When you control both the silicon and the model mapping, you can tune precision to exactly what each layer needs.

The tradeoff is real: each chip runs exactly one model. A new model requires a new chip. Taalas's bet is that a handful of dominant models—the ones handling billions of daily requests—justify dedicated silicon.

The Numbers

HC1, their first-generation platform, is hardwired for Llama 3.1 8B.

Speedtokens / sec
GPU
~1,700
HC1
0
Build costrelative
GPU
HC1
0.00×
Powerrelative
GPU
HC1
0.0×

Built by 24 people. $30M spent of more than $200M raised. The efficiency comes from total specialization: one chip, one model, compute in memory.

HC2, the second generation, adopts standard 4-bit floating-point formats with higher density and speed, targeting frontier-scale models.

Why This Matters

If AI inference can be made 10-20× cheaper and more efficient, it changes what's possible. Real-time AI in every device. Inference at the edge. AI as utility infrastructure, not a luxury compute resource.

Total specialization—one chip, one model, compute in memory—is Taalas's bet on the path to ubiquitous AI.

Silicon Vocabulary

DRAM, CIM, HBM, ASIC—these acronyms get thrown around constantly in chip discussions, but they mean very specific things. Understanding the distinction helps the concepts above click into place.

MEMORY

DRAM

Dynamic Random-Access Memory

The main memory in computers. Stores each bit as a charge on a tiny capacitor. Dense and cheap, but slow (~100ns access). Must be constantly refreshed because capacitors leak charge.

In this article

Taalas embeds compute directly inside DRAM cells, turning slow memory into fast compute.