Your MacBook Can Do Autonomous AI Research Now

Naman Goyal

doi:10.5281/zenodo.18956067

What if your MacBook could run autonomous AI research while you sleep?

That is exactly what autoresearch does. Andrej Karpathy built a system where an AI agent modifies a training script, trains a small language model for 5 minutes, checks if the result improved, and loops. You wake up to a log of experiments and (hopefully) a better model. The concept is wild, the code is real, and the results are legit. The catch? It was built for H100 GPUs.

I wanted to run it on my MacBook. So I ported the whole thing to Apple Silicon using MLX, Apple’s machine learning framework. No CUDA, no cloud GPU, no $3/hour rental fees. Just your Mac. The code is at github.com/thenamangoyal/autoresearch (Goyal, 2026).

Does it actually work? Here is the short version. An untrained model starts with a training loss of 9.01, which is essentially random noise over an 8,192-token vocabulary (log₂(8192) ≈ 13, but BPE tokens encode multiple bytes, so the effective starting point is lower). After just 5 minutes and 55 optimizer steps on an M1 Pro with 16GB, the loss drops to 6.76 and the validation score lands at 2.371 BPB. The model is clearly learning language structure. Give it more time and the numbers keep dropping: the community MLX fork on an M4 Max reports 1.808 BPB after a single run, and 1.295 BPB after letting the autonomous loop iterate overnight. On the original H100, Karpathy’s CUDA code fits ~11,500 steps in the same 5-minute window and reaches roughly 1.0 BPB. The model and code are the same. The gap is 96x less compute on the Mac.

Results at a Glance

Here is what a single 5-minute run looks like on my M1 Pro 16GB:

Metric	Value
Starting train loss	9.012
Final train loss	6.762
val_bpb	2.371
Steps	55
Tokens Processed	3.6M
Peak Memory	11.0 GB
Parameters	11.5M

The obvious question: how does this compare to the H100 the system was designed for?

{
  "data": [
    {
      "x": ["Tok/sec", "Steps in 5 min"],
      "y": [2500000, 11500],
      "name": "H100 (CUDA)",
      "type": "bar",
      "marker": {"color": "#636EFA"}
    },
    {
      "x": ["Tok/sec", "Steps in 5 min"],
      "y": [26000, 55],
      "name": "M1 Pro 16GB (MLX)",
      "type": "bar",
      "marker": {"color": "#EF553B"}
    }
  ],
  "layout": {
    "title": {"text": "H100 vs M1 Pro: Raw Throughput"},
    "yaxis": {"title": "Value", "type": "log"},
    "barmode": "group",
    "legend": {"yanchor": "top", "y": 0.99, "xanchor": "right", "x": 0.99}
  }
}

Yes, the H100 is roughly 96x faster. That is not the point. The point is that you can run this at all on a laptop, iterate on architecture ideas, and prototype training experiments before ever touching a cloud GPU. For learning and experimentation, the Mac is more than enough.

The --time-budget flag lets you scale up when you have time to spare:

{
  "data": [
    {
      "x": [3.6, 23, 47, 94],
      "y": ["5 min", "15 min", "30 min", "60 min"],
      "type": "bar",
      "orientation": "h",
      "marker": {"color": ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"]},
      "text": ["~55 steps", "~357 steps", "~714 steps", "~1429 steps"],
      "textposition": "auto"
    }
  ],
  "layout": {
    "title": {"text": "Tokens Processed by Time Budget (M1 Pro)"},
    "xaxis": {"title": "Millions of Tokens"},
    "yaxis": {"title": ""},
    "showlegend": false,
    "margin": {"l": 80}
  }
}

In my experience, 15 to 30 minutes hits the sweet spot on Apple Silicon. You get enough steps for the optimizer to meaningfully converge, without burning hours waiting.

The Autonomous Research Loop

The genius of autoresearch is how simple the loop is. An AI agent (Claude, GPT, whatever you prefer) reads the instructions in program.md, tweaks the model code in train.py, commits the change, trains for exactly 5 minutes, checks BPB, and either keeps the commit or reverts. Then it does it again. All night if you let it.

flowchart TD
    A[Read program.md] --> B[Modify train.py]
    B --> C[Git commit]
    C --> D[Train for 5 min]
    D --> E[Read val_bpb]
    E --> F{Improved?}
    F -- Yes --> G[Keep commit]
    F -- No --> H[Git revert]
    G --> I[Log to results.tsv]
    H --> I
    I --> B

The design is intentionally constrained. One file to edit. One metric to optimize. A fixed time budget so every experiment is directly comparable regardless of what the agent changes: model size, architecture, optimizer, batch size, all of it is fair game inside train.py. The agent’s creativity is the only variable.

As Karpathy put it in the original repo:

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of “group meeting”. That era is long gone.

On the Mac, each experiment takes about 5 minutes, which gives you roughly 12 experiments per hour and close to 100 if you let it run overnight. Not quite datacenter speed, but enough to explore a surprising number of architectural ideas.

Model Architecture

The model is a GPT variant (Vaswani et al., 2017) with several modern tricks packed into a surprisingly compact 11.5M parameter design. I found it interesting how many recent ideas Karpathy managed to squeeze into a single training file. Here is the high-level flow:

flowchart TD
    T[Token IDs] --> WTE[Token Embedding]
    T --> VE[Value Embedding]
    WTE --> RS[Residual Scaling]
    RS --> N1[RMSNorm]
    N1 --> ATT[Self-Attention + RoPE]
    VE --> ATT
    ATT --> N2[RMSNorm]
    N2 --> MLP[MLP · Squared ReLU]
    MLP -->|repeat x DEPTH| RS
    MLP --> NF[RMSNorm]
    NF --> LM[lm_head]
    LM --> SC[Softcap · tanh]

Let me walk through what makes this architecture interesting.

Value Embeddings (ResFormer). This one surprised me. Every other layer gets its own value embedding table. Instead of computing values purely from the hidden state, the model mixes in a direct embedding lookup gated by a small linear projection. It is the ResFormer trick: a residual connection through the value path that helps gradient flow in deep models. In a 4-layer model it might seem unnecessary, but it lets the agent experiment with deeper configurations without running into vanishing gradient issues.

Sliding Window Attention (SSSL). The window pattern SSSL means three layers of short-range (half context length) attention followed by one layer of long-range (full context) attention. On the H100, this is handled by Flash Attention (Dao et al., 2022). On MLX, I implemented it with additive masks instead. It is slower, but functionally equivalent. The last layer always gets full attention regardless of the pattern, so the model can always attend to the full 2,048-token context when it matters.

RoPE with QK-Norm. Queries and keys get rotary position embeddings (Su et al., 2021) followed by RMS normalization. This combination stabilizes training and eliminates the need for learned position embeddings. It is one of those small details that just works.

Softcap at 15.0. The logits pass through 15 * tanh(logits / 15) before the loss computation. This bounds logit magnitudes and prevents the model from becoming overconfident. The technique comes from Gemma 2 (Gemma Team et al., 2024) and is especially helpful early in training when the model might otherwise push logits to extreme values on frequent tokens.

Squared ReLU. The MLP activation is relu(x)² instead of GELU or SiLU. Simpler, faster, and produces sparser activations. In a small model where every FLOP counts, this is a sensible choice.

Per-Layer Residual Scaling. Each layer has two learnable scalars that control how much the running hidden state and the original embedding contribute. The core of the forward pass looks like this:

x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
ve = self.value_embeds[str(i)](idx) if str(i) in self.value_embeds else None
x = block(x, ve, masks[i])

This gives the model fine-grained control over each layer’s contribution to the residual stream. In practice, I noticed the model learns to progressively reduce x0_lambda in later layers, relying more on the transformed representations as depth increases.

Optimizer: AdamW with Six Parameter Groups

This is not your typical “one learning rate for everything” AdamW (Loshchilov & Hutter, 2017). The optimizer splits the model into six groups, each with its own learning rate, betas, and weight decay. Getting this right turned out to be one of the trickier parts of the port.

Group	LR	Betas	Weight Decay	Scaling
Embeddings (wte)	0.6	(0.8, 0.95)	0.0	1/sqrt(d/768)
Value Embeddings	0.6	(0.8, 0.95)	0.0	1/sqrt(d/768)
Unembedding (lm_head)	0.004	(0.8, 0.95)	0.0	1/sqrt(d/768)
Transformer Matrices	0.04	(0.8, 0.95)	0.2	None
resid_lambdas	0.005	(0.8, 0.95)	0.0	None
x0_lambdas	0.5	(0.96, 0.95)	0.0	None

A few things stand out. Embedding learning rates are 150x higher than unembedding rates, because embeddings need to move fast early in training while the output projection benefits from stability. Only the transformer matrices get weight decay. The x0_lambdas use a beta1 of 0.96 (vs 0.8 for everything else), giving them more momentum so they adjust slowly and smoothly. And embedding/unembedding LRs scale with 1/sqrt(model_dim/768), following the dimension-scaled convention from recent scaling law work.

Weight decay schedule. One improvement I made over the reference MLX fork is a linear weight decay schedule. The effective weight decay is WEIGHT_DECAY * (1 - progress), which decays to zero by the end of training. The intuition is simple: during the warmdown phase when the learning rate is already near zero, weight decay just pushes weights toward zero without the optimizer being able to pull them back. The upstream CUDA version does this, but the reference fork missed it. In my tests it typically yields 2 to 5% BPB improvement, which is significant for free.

LR schedule. No warmup at all (WARMUP_RATIO = 0.0). The second half of training is spent in linear warmdown to zero. This aggressive schedule makes sense for a 5-minute budget where you cannot afford to spend steps warming up.

Data Pipeline

The training data comes from ClimbMix 400B, a curated web-scale dataset downloaded as parquet shards into ~/.cache/autoresearch/. Running prepare.py once downloads the data and trains a BPE tokenizer with a vocabulary of 8,192 tokens. The whole process takes about 2 minutes.

The dataloader uses BOS-aligned best-fit packing: documents are packed into sequences of 2,048 tokens, each starting with a BOS token. Documents that do not fit are split across sequences, but every sequence begins at a document boundary. This achieves close to 100% token utilization with no padding waste, which matters when you only have 3.6M tokens to work with in a 5-minute budget.

On Apple Silicon, the big win is unified memory. There is no CPU-to-GPU transfer. The data, model weights, optimizer states, and gradients all live in the same memory pool. On a CUDA setup with a consumer GPU, PCIe bandwidth can bottleneck the dataloader. On a Mac, that entire category of performance issues just does not exist. The tradeoff is that you are sharing that same memory pool with macOS, your browser, and everything else running on the machine, which is exactly why the 16GB M1 Pro hits memory pressure during training.

Why BPB instead of perplexity? The evaluation metric is bits per byte (BPB), computed as cross_entropy_loss * (num_tokens / num_bytes). This normalizes for tokenizer vocabulary size, so you can swap the tokenizer or change the vocab size and still get comparable numbers. It is the right metric for autoresearch, where the agent might decide to change the tokenization strategy entirely. Lower is better.

Training in Action

Here is real data from my 5-minute benchmark. Every data point is from the actual run, nothing simulated:

{
  "data": [
    {
      "x": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54],
      "y": [9.012,9.012,8.688,8.426,8.216,8.060,7.960,7.882,7.812,7.757,7.705,7.670,7.631,7.604,7.576,7.553,7.527,7.511,7.492,7.464,7.444,7.414,7.396,7.379,7.355,7.334,7.309,7.291,7.274,7.244,7.226,7.207,7.192,7.168,7.145,7.119,7.094,7.071,7.048,7.022,7.000,6.984,6.965,6.948,6.933,6.913,6.894,6.877,6.856,6.834,6.817,6.801,6.787,6.770,6.762],
      "type": "scatter",
      "mode": "lines+markers",
      "name": "Training Loss (EMA)",
      "line": {"color": "#636EFA", "width": 2},
      "marker": {"size": 4}
    }
  ],
  "layout": {
    "title": {"text": "Training Loss Curve (M1 Pro 16GB, 5-min budget)"},
    "xaxis": {"title": "Step"},
    "yaxis": {"title": "Smoothed Training Loss"},
    "showlegend": false
  }
}

The loss drops steadily from 9.01 to 6.76 over 55 steps. No spikes, no NaN incidents, no instability. That smooth curve is exactly what you want to see from a well-configured optimizer on a small model.

But the throughput chart tells a more interesting story:

{
  "data": [
    {
      "x": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54],
      "y": [7604,24127,22891,22148,18671,14971,22997,23022,23742,17105,16300,20709,25035,22540,18241,11142,20179,17631,16099,18564,17911,17994,14105,1742,2628,5754,5672,3634,7285,2245,5507,6574,16197,29251,29009,29612,29569,28550,25040,28735,28859,28095,27022,25195,20166,24015,24746,14838,18569,25240,26747,26207,29411,28793,28899],
      "type": "scatter",
      "mode": "lines+markers",
      "name": "Tok/sec",
      "line": {"color": "#EF553B", "width": 2},
      "marker": {"size": 4}
    }
  ],
  "layout": {
    "title": {"text": "Throughput Over Training (M1 Pro 16GB)"},
    "xaxis": {"title": "Step"},
    "yaxis": {"title": "Tokens per Second"},
    "showlegend": false,
    "annotations": [
      {
        "x": 23, "y": 1742,
        "text": "Memory pressure",
        "showarrow": true,
        "arrowhead": 2,
        "ax": 40, "ay": -40,
        "font": {"size": 11, "color": "#EF553B"}
      },
      {
        "x": 29, "y": 2245,
        "text": "Swap thrashing",
        "showarrow": true,
        "arrowhead": 2,
        "ax": 40, "ay": -40,
        "font": {"size": 11, "color": "#EF553B"}
      }
    ]
  }
}

See those dramatic dips around steps 23 to 30? That is macOS memory pressure in action. With the model weights, optimizer states (two momentum buffers per parameter group), gradient accumulators, and attention masks all resident in unified memory, the 16GB M1 Pro is right at the edge. When macOS starts compressing and swapping pages, throughput drops from ~26K tok/sec to under 2K. Each of those slow steps takes 15 to 37 seconds instead of the usual 2.5 seconds. The training still converges correctly (the loss curve above is smooth right through those dips), but you lose precious steps from your 5-minute budget. On an M4 Max with 36 or 64GB, these dips disappear entirely and you get consistent 29K+ tok/sec throughout.

Improvements Over the Reference Fork

When I started this port, trevin-creator/autoresearch-mlx already existed as an early community MLX adaptation. I used it as a reference but ended up rewriting most of the training loop to fix several issues I ran into:

Improvement	Impact	Reference Behavior
NaN loss detection	Prevents silent training corruption	Only checks `loss > 100`; NaN passes silently
`FINAL_EVAL_BATCH_SIZE=16`	Enables 16GB Macs to complete eval	Uses 256, OOMs on 16GB machines
`if __name__ == "__main__"` guard	Enables agent import workflow	Runs training at import time
Weight decay schedule	~2-5% BPB improvement	Constant weight decay throughout
`estimate_flops()`	Per-token FLOP estimation with window sizes	Not present
`num_scaling_params()`	Detailed parameter breakdown by category	Flat parameter count only
MFU calculation	Configurable via `PEAK_FLOPS_TFLOPS` env var	Hardcoded 0.0 placeholder
Config logging	Full `GPTConfig` printed via `asdict()`	No config output
Phase timing	Separate training/eval timing logs	Partial

The eval batch size issue was the most frustrating to debug. The reference fork uses FINAL_EVAL_BATCH_SIZE=256, which works fine on an M4 Max with 27GB available. But on a 16GB machine, the evaluation step triggers an out-of-memory crash after the 5-minute training run completes successfully. You sit through the whole training, only for the eval to kill the process and lose everything. Reducing it to 16 eliminates this entirely.

The __main__ guard is subtle but important for the autonomous agent workflow. Without it, the agent’s import train (which it uses to inspect the module) silently kicks off a full training run. That is 5 minutes wasted before the agent even starts its actual experiment.

The reference fork reports a best result of 1.295 BPB on M4 Max and 1.353 BPB on Mac Mini after extended autonomous runs. My single 5-minute baseline on M1 Pro 16GB reaches 2.371 BPB. The gap is hardware, not code: fewer steps due to memory pressure and a slower processor. The raw per-step training throughput is identical between the two implementations since both use the same MLX ops. On equivalent hardware with the weight decay fix and other improvements, I would expect this port to match or beat the reference fork’s numbers.

Try It Yourself

Four commands and you are training:

# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Run a training experiment
uv run train.py

Want a longer run? Use the --time-budget flag:

uv run train.py --time-budget 600   # 10 minutes
uv run train.py --time-budget 1800  # 30 minutes
uv run train.py --time-budget 3600  # 1 hour

To run the full autonomous loop, point an AI agent (Claude Code, Codex, or similar) at program.md in the repo:

Hi, have a look at program.md and let's kick off a new experiment!
Let's do the setup first.

The agent will create a branch, establish a baseline BPB, and start iterating. Make sure to disable all permission prompts so it can run unattended. On an M1 Pro you will get about 12 experiments per hour. Let it run overnight and check the git log in the morning.

Requirements: Apple M-series Mac (M1/M2/M3/M4), Python 3.10+, uv.

The code is at github.com/thenamangoyal/autoresearch.

Conclusion

You do not need an H100 to experiment with autonomous AI research. A MacBook Pro can train an 11.5M parameter language model on real web data, with a proper six-group AdamW optimizer, modern architecture tricks like RoPE and value embeddings, and a clean autonomous iteration loop. Yes, it is 96x slower than a datacenter GPU. But it costs nothing, it sits on your desk, and it works.

There is plenty of room to push this further. The Muon optimizer from the upstream CUDA version is not yet ported (it needs SVD operations that require careful MLX adaptation). M4 Max and M4 Ultra machines with 64GB+ of unified memory should eliminate the throughput dips and unlock much longer effective training runs. And the autonomous loop itself could be made smarter: better search strategies, multi-objective optimization, cross-experiment learning from the full results history.

For now, the foundation works. Clone the repo, run prepare.py, and let your MacBook do some research overnight. You might be surprised by what it finds.

Thanks to Andrej Karpathy for the original autoresearch concept and to the trevin-creator community fork for the initial MLX exploration.