Machine Learning Systems Intern
Hybrid SSM‑Transformer models have a unique advantage for on‑chip memory efficiency:SSM layerscompress sequence history into a fixed‑size recurrent state
Attention layersstore key‑value caches that grow with context lengthThis leads to an important design question:
For a given model configuration and maximum context length, can on‑chip SRAM be sized so that inference runs entirely on chip—eliminating the need for slower off‑chip HBM or DRAM?What the intern will work on:The intern will model and analyze memory behavior during inference of hybrid SSM‑Transformer models, with a focus on avoiding off‑chip memory accesses. Key responsibilities include:Modeling data movement betweenSRAM and HBM/DRAMduring inference
Sweeping parameters such as:
SRAM capacity
Context length
Model dimensions
Mapping thefeasibility boundarywhere inference can be performed fully on chip
Breaking downper‑layer memory working sets
Identifyingwhen and why memory spills occur
Exploringtiling and scheduling strategiesto extend the no‑spill region
Validating analytical results throughsimulation