No description
  • Python 82%
  • SystemVerilog 13.6%
  • C 3.3%
  • Makefile 1.1%
Find a file
2026-05-16 14:13:40 -06:00
c_sim PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
docs Scrub: remove local paths and internal URLs 2026-05-16 12:12:19 -06:00
fpga_build fpga: Phase 0 complete - toolchain verified, blinky bitstream built 2026-05-14 19:08:43 -06:00
golden verification: venv setup, stress test (1000/1000), C-golden bridge 2026-05-14 17:20:10 -06:00
rtl fpga: Phase 0 complete - toolchain verified, blinky bitstream built 2026-05-14 19:08:43 -06:00
scripts PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
sim PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
.gitignore PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
benchmark_harness.py fix: critical correction — per-layer weight is 279MB, not 17.45GB 2026-05-14 17:58:24 -06:00
bridge_c_golden.py verification: venv setup, stress test (1000/1000), C-golden bridge 2026-05-14 17:20:10 -06:00
bridge_model.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
config.json PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
formal_verification.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
m1_baked_svd.py m1+m2: baked Basis Sharing (2.6x, low-rank from birth) + sensitivity analysis 2026-05-14 18:15:16 -06:00
m2_sensitivity.py m1+m2: baked Basis Sharing (2.6x, low-rank from birth) + sensitivity analysis 2026-05-14 18:15:16 -06:00
m3_scaling_math.py m3+m4+m5: scaling math (135M-405B), multi-PNU mesh, systolic-native attention 2026-05-14 18:18:07 -06:00
m4_multi_pnu.py m3+m4+m5: scaling math (135M-405B), multi-PNU mesh, systolic-native attention 2026-05-14 18:18:07 -06:00
m5_systolic_attention.py m3+m4+m5: scaling math (135M-405B), multi-PNU mesh, systolic-native attention 2026-05-14 18:18:07 -06:00
materials_model.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
mlperf_validation.py fix: critical correction — per-layer weight is 279MB, not 17.45GB 2026-05-14 17:58:24 -06:00
README.md docs: update clone URL to ethans.studio 2026-05-16 14:13:40 -06:00
requirements.txt verification: venv setup, stress test (1000/1000), C-golden bridge 2026-05-14 17:20:10 -06:00
scaling_model.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
stress_test.py verification: venv setup, stress test (1000/1000), C-golden bridge 2026-05-14 17:20:10 -06:00
t2_mlperf_benchmarks.py t1+t2+t4: RTL fix, MLPerf benchmarks, larger co-designed model 2026-05-14 18:24:49 -06:00
t4_train_larger_model.py t1+t2+t4: RTL fix, MLPerf benchmarks, larger co-designed model 2026-05-14 18:24:49 -06:00
tco_model.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
thermal_model.py fix: critical correction — per-layer weight is 279MB, not 17.45GB 2026-05-14 17:58:24 -06:00
train_baby_model.py Scrub: remove local paths and internal URLs 2026-05-16 12:12:19 -06:00
v1_c_sim_inference.py v1+v2+v3: virtual testing complete - C sim inference, co-simulation, cycle model 2026-05-14 18:30:44 -06:00
v2_v3_cosim.py v1+v2+v3: virtual testing complete - C sim inference, co-simulation, cycle model 2026-05-14 18:30:44 -06:00
VERIFICATION_LOG.md docs: complete verification log — all 5 proof-of-concept stages passed 2026-05-14 18:04:42 -06:00
verify.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00
yield_model.py PNU: Physical Neural Unit — initial commit 2026-05-14 17:02:05 -06:00

PNU — Physical Neural Unit

A carbon-based monochip architecture where the model IS the silicon. No DRAM. No HBM. No data movement. Just computation.

Status Verification License


The Problem

GPUs and TPUs spend >90% of energy moving weights between off-chip memory and compute units. This is the "von Neumann bottleneck" at exascale. H100s burn 700W — only 10% goes to actual math.

The Solution

Weight-stationary architecture. Weights are baked into on-chip SRAM at synthesis time. They never move. Activations flow through a systolic array as the sole moving data. No off-chip memory access. No cache misses. No DMA.

The result: 911× the theoretical throughput of an H100 at equal power, or equivalent throughput at 1/900th the energy.

Architecture

Feature Detail
Systolic array Weight-stationary, activations-only dataflow
Memory On-chip SRAM only (no DRAM/HBM/cache hierarchy)
Modular Identical tiles scale from phone → hyperscale
Substrate-agnostic Same RTL ports to Si, SiC, graphene, diamond
3D-ready Vertical TSV stacking for 405B+ dense models
Defect-tolerant Tile-level spare rows (Cerebras-style)

The Name

Acronym Meaning Does
CPU Central Processing Unit Runs instructions
GPU Graphics Processing Unit Renders pixels
TPU Tensor Processing Unit Multiplies matrices
PNU Physical Neural Unit IS the model

A PNU doesn't load weights. It's manufactured with them.

Quick Start

git clone https://ethans.studio/pnu.git
cd pnu
pip install -r requirements.txt
python verify.py              # Golden model (11/11 tests)
cd c_sim && make && ./lumi_sim ../config.json   # C simulator
cd sim && make                # RTL (requires iverilog)

Repository

pnu/
├── golden/                    Golden model (Python, 11/11 tests)
├── c_sim/                     Cycle-accurate C simulator
├── rtl/                       SystemVerilog (8 modules)
├── sim/                       Testbenches + verification vectors
├── docs/                      Planning, paper outline, research
├── scripts/                   Code generation from config.json
├── fpga_build/                ECP5 synthesis (Yosys + nextpnr)
│
├── thermal_model.py           712W, 82°C junction
├── yield_model.py             $2,357/die with defect tolerance
├── materials_model.py         Graphene vs Diamond vs SiC
├── scaling_model.py           Phone → Datacenter scaling
├── tco_model.py               $0.03/1M tokens vs H100's $4.30
├── bridge_model.py            Co-design closes GPU-to-chip gap
│
├── verify.py                  Full verification suite
├── formal_verification.py     5 properties
└── config.json                Single source of truth

Current Status (May 2026)

  • Golden model: 11/11 tests passing
  • C simulator: Verified equivalent (1,000/1,000 stress test)
  • RTL: 8 SystemVerilog modules, iverilog-compatible
  • FPGA: ECP5 synthesis ready (6% LUTs for 8×8 array)
  • Paper: arXiv draft in progress (~22K words, 10 sections)
  • Next: Tiny Tapeout physical proof

Key Numbers

Metric PNU H100 Improvement
Tok/s (theoretical) 72,878 80 911×
Tok/s (co-designed) 4,721 80 59×
Power 712W 700W
Cost/1M tokens $0.03 $4.30 143×
Die cost $2,357 ~$300

Design Philosophy

  • First principles. Every number traced to published sources.
  • Own your tradeoffs. State weaknesses with numbers.
  • Hardware/algorithm co-design. Model and chip designed together.
  • "God bows down to math." No claims without verification.

Paper

Forthcoming on arXiv (cs.AR). Authors: Gandalf R. Whale & L. Mithrandir. Independent Research, 2026.

License

MIT