Two AI Agents, One MacBook, Zero API Keys

Code: github.com/thenamangoyal/multi-agent-mlx

Watch two AI agents collaborate in real time. The Coder writes code, the sandbox runs it, and the Sheriff reviews the output. If something breaks, the error goes back to the Coder.

What happens when you put two AI agents in a room, give them a coding task, and let them argue until the code works?

That is the premise behind this project. Agent A (the Coder) writes Python scripts. Agent B (the Sheriff) runs them, reads the stack traces, and tells the Coder what it got wrong. They go back and forth until the code passes or a hard limit kicks in, following the iterative self-refinement pattern (Madaan et al., 2023) with execution-grounded feedback (Shinn et al., 2023). The catch is that both agents run on a single MacBook using Qwen2.5-Coder-7B-4bit (Hui et al., 2024). No OpenAI key, no cloud GPU, no data ever leaving the machine.

I built this to answer a question that kept nagging me: can two small local models actually collaborate to produce working code, the same way cloud-based agents do with GPT-4? The answer, it turns out, is surprisingly yes. With some interesting caveats about what small models can and cannot self-correct.

The code is at github.com/thenamangoyal/multi-agent-mlx (Goyal, 2026).

Results at a Glance

I ran three scenarios of increasing difficulty, all on an M1 Pro with 16 GB of RAM:

Scenario Task Attempts Time Result
The Off-by-One Gauntlet Generate a formatted calendar for March 2026 without the calendar module 1 22s First-try success
The CSV Detective Generate 200-row CSV, read it back, compute revenue analytics with exact formatting 3 159s Self-corrected twice
Gradient Descent from Scratch Train a neural network with manual backprop using only numpy 2 50s Self-corrected once

Total wall time: 3 minutes 50 seconds. Total cost: $0.00.

{
  "responsive": true,
  "tooltip": {"trigger": "axis"},
  "xAxis": {
    "type": "category",
    "data": ["Calendar", "CSV Detective", "Neural Net"],
    "axisLabel": {"fontSize": 13}
  },
  "yAxis": {
    "type": "value",
    "name": "Seconds",
    "nameTextStyle": {"fontSize": 13}
  },
  "series": [
    {
      "type": "bar",
      "data": [
        {"value": 22.0, "itemStyle": {"color": "#00CC96"}},
        {"value": 158.8, "itemStyle": {"color": "#636EFA"}},
        {"value": 49.6, "itemStyle": {"color": "#EF553B"}}
      ],
      "barWidth": "50%",
      "label": {
        "show": true,
        "position": "top",
        "formatter": ["{b|1 attempt}", "{b|3 attempts}", "{b|2 attempts}"],
        "rich": {"b": {"fontSize": 12}}
      }
    }
  ]
}

Every scenario eventually produced correct, running code. The interesting part is not that they succeeded. It is how they failed first and then fixed themselves.

The Architecture

The key constraint is 16 GB of RAM. Two separate 7B models would blow past that. The solution: both agents share a single mlx_lm.server instance serving one model (Qwen2.5-Coder-7B-Instruct-4bit, ~4 GB). They are two system-prompt identities taking sequential turns, orchestrated by plain Python. Not a free-form multi-agent chat. A structured feedback loop.

Memory Budget

Component RAM Notes
macOS + system ~3-4 GB Baseline
Qwen2.5-Coder-7B-4bit ~4 GB Single model, loaded once
KV cache ~1-2 GB Prompt context
Python + sandbox ~0.5 GB Orchestrator + scripts
Total ~9-10 GB 6-7 GB headroom

Why the Orchestrator Executes Code Directly

This was a hard-won lesson. My initial design had the Sheriff agent calling an execute_code tool through the framework’s tool-calling protocol. It never worked reliably. Small quantized models (4-bit 7B) are inconsistent at structured tool calling. They would output the code in a markdown block instead of invoking the tool function, or hallucinate JSON that did not match the schema.

The fix was to take the mechanical execution out of the LLM’s hands entirely. The orchestrator extracts code from whatever the Coder produces (tool call or markdown), runs it in a subprocess, and feeds the raw stdout/stderr/exit code into the Sheriff’s prompt. The Sheriff’s job is reduced to what LLMs are actually good at: reading text and making judgments. Think of it like a code review. The reviewer does not need to run gcc themselves. A CI system runs the build, and the reviewer reads the output.

The Self-Correcting Loop

As you saw in the simulation above, each attempt follows the same pattern: the Coder generates a script, the orchestrator extracts and executes it in a sandbox, and the Sheriff reviews the output. If the Sheriff says FAIL, the error report goes back to the Coder with specific fix suggestions. If it says PASS, the loop ends.

Three safety layers prevent infinite loops:

  1. Hard limits. Maximum 5 attempts, 60-second execution timeout, 120-second LLM timeout per call.
  2. Stagnation detection. The system hashes the last 5 lines of each error traceback using MD5. If the same hash appears 3 times in a row, the model is stuck and the loop terminates.
  3. Token budget. A hard cap of 100K tokens across all turns prevents runaway context accumulation.

Scenario 1: The Off-by-One Gauntlet

Task Given to Coder

Write a Python script that generates a formatted calendar for March 2026. Print column headers (Mon-Sun), a day grid with right-aligned 4-character columns, and count weekdays vs weekend days. March 1, 2026 is a Sunday. Do NOT use the calendar module. Final line must be exactly Weekdays: 22, Weekend days: 9.

What happened: The Coder nailed it on the first try. It used datetime to verify the day of the week, wrote a clean grid layout with proper alignment, and counted weekdays correctly. The Sheriff confirmed the output and returned VERDICT: PASS in 22 seconds flat.

Attempt 1 - PASS (22s)

      3 2026
 Mon Tue Wed Thu Fri Sat Sun
                           1
   2   3   4   5   6   7   8
   9  10  11  12  13  14  15
  16  17  18  19  20  21  22
  23  24  25  26  27  28  29
  30  31
Weekdays: 22, Weekend days: 9

This is the easiest scenario, but it is not trivial. The model needs to handle date math, grid alignment with padding, and the edge case of a month starting on Sunday. A single off-by-one error in the first-day offset would cascade through every row. The model got it right in one shot.

Scenario 2: The CSV Detective

Task Given to Coder

Write a complete data pipeline in one script. Step 1: Generate a 200-row CSV (sales_data.csv) with columns date, product, region, units, price_per_unit using random.seed(42). Step 2: Read it back, compute revenue per product (sorted descending), top region, and top month. Step 3: Print a formatted report with dollar amounts using comma separators like $12,345.67. Standard library only.

This one tells the best self-correction story. The model needed three attempts, each failing in a different way.

Attempt 1 - FAIL (scope error)

The code was too long and hit the token limit before the closing ```. The orchestrator salvaged what it could via its unclosed-block extractor, but the script had a scoping bug: print_report() referenced a variable defined in a different function.

NameError: name 'region_revenue' is not defined
  File "script.py", line 63, in print_report
    print(f"Top Region: {top_region} (${region_revenue[top_region]:,.2f})")

Sheriff: "The variable region_revenue is not defined in print_report(). It is defined in analyze_data() but is not accessible. Suggested fix: pass it as a parameter."

Attempt 2 - FAIL (different bug)

The Coder fixed the scope issue by passing region_revenue as a parameter. But it introduced a new bug: it forgot to also pass month_revenue.

NameError: name 'month_revenue' is not defined. Did you mean: 'total_revenue'?
  File "script.py", line 59, in print_report
    print(f"Top Month: {top_month} (${month_revenue[top_month]:,.2f})")

Sheriff: "Same pattern as before. month_revenue is not passed to print_report(). Fix all remaining scope issues."

Attempt 3 - PASS (159s total)

=== Sales Analysis Report ===
Revenue by Product:
  Doohickey: $125,089.64
  Gadget: $90,750.45
  Widget: $90,696.45
Top Region: North ($101,205.15)
Top Month: May ($42,193.29)
Total Records: 200
{
  "responsive": true,
  "tooltip": {"trigger": "axis", "axisPointer": {"type": "shadow"}},
  "title": {"text": "Revenue by Product", "subtext": "Agent-generated CSV analysis", "left": "center"},
  "xAxis": {
    "type": "category",
    "data": ["Doohickey", "Gadget", "Widget"],
    "axisLabel": {"fontSize": 13}
  },
  "yAxis": {
    "type": "value",
    "name": "Total Revenue ($)",
    "axisLabel": {"formatter": "${value}"}
  },
  "series": [
    {
      "type": "bar",
      "data": [
        {"value": 125090, "itemStyle": {"color": "#e94560"}},
        {"value": 90750, "itemStyle": {"color": "#0f3460"}},
        {"value": 90696, "itemStyle": {"color": "#533483"}}
      ],
      "barWidth": "50%",
      "label": {
        "show": true,
        "position": "top",
        "formatter": ["$125,090", "$90,750", "$90,696"],
        "fontSize": 13,
        "fontWeight": "bold"
      },
      "emphasis": {"itemStyle": {"shadowBlur": 10, "shadowColor": "rgba(0,0,0,0.3)"}}
    }
  ]
}

The CSV scenario is interesting because it tests the full data pipeline: file I/O, random data generation with a seed, reading data back, aggregation, and precise string formatting. Each of those is a potential failure point, and the model had to coordinate all of them in a single script.

Scenario 3: Gradient Descent from Scratch

Task Given to Coder

Write a neural network using only numpy. No ML frameworks. Generate 200 binary classification points with np.random.seed(42). Build a single-layer network: Input(2) -> Sigmoid -> Output(1). Train for 1000 epochs with learning rate 1.0 using binary cross-entropy loss and manual backprop. Print progress every 200 epochs. Final accuracy must be >= 90%. Save weights to model_data.npz.

Attempt 1 - FAIL (code mixed with prose)

The Coder tried to use the write_file() tool but then appended natural-language bullet points after the code. The result was valid Python mixed with markdown commentary that Python could not parse.

  File "script.py", line 65
    - The script generates 200 data points for binary classification.
          ^^^^^^
SyntaxError: invalid syntax

Sheriff: "The line - The script generates 200 data points is not valid Python code. This line is intended as a comment but is being interpreted as code."

Attempt 2 - PASS (50s total)

Clean 54-line script. Forward pass, backward pass, weight update, all mathematically correct. 100% accuracy:

Epoch 200: loss=0.0819, acc=100.0%
Epoch 400: loss=0.0603, acc=100.0%
Epoch 600: loss=0.0504, acc=100.0%
Epoch 800: loss=0.0444, acc=100.0%
Epoch 1000: loss=0.0402, acc=100.0%
Final accuracy: 100.0%

The training curve shows textbook convergence, from near-random (0.69) to 0.04 loss:

{
  "responsive": true,
  "title": {"text": "Training Convergence", "subtext": "Agent-written neural network", "left": "center"},
  "tooltip": {"trigger": "axis"},
  "legend": {"data": ["Loss", "Accuracy (%)"], "bottom": "0%"},
  "xAxis": {"type": "category", "data": ["0", "200", "400", "600", "800", "1000"], "name": "Epoch", "nameLocation": "center", "nameGap": 30},
  "yAxis": [
    {"type": "value", "name": "Loss", "position": "left", "min": 0, "max": 0.75},
    {"type": "value", "name": "Accuracy (%)", "position": "right", "min": 0, "max": 110, "splitLine": {"show": false}}
  ],
  "series": [
    {
      "name": "Loss",
      "type": "line",
      "data": [0.693, 0.0819, 0.0603, 0.0504, 0.0444, 0.0402],
      "smooth": true,
      "symbol": "circle",
      "symbolSize": 8,
      "lineStyle": {"color": "#e94560", "width": 3},
      "itemStyle": {"color": "#e94560"},
      "areaStyle": {"color": {"type": "linear", "x": 0, "y": 0, "x2": 0, "y2": 1, "colorStops": [{"offset": 0, "color": "rgba(233,69,96,0.3)"}, {"offset": 1, "color": "rgba(233,69,96,0.02)"}]}}
    },
    {
      "name": "Accuracy (%)",
      "type": "line",
      "yAxisIndex": 1,
      "data": [50, 100, 100, 100, 100, 100],
      "smooth": true,
      "symbol": "circle",
      "symbolSize": 8,
      "lineStyle": {"color": "#00CC96", "width": 3},
      "itemStyle": {"color": "#00CC96"},
      "areaStyle": {"color": {"type": "linear", "x": 0, "y": 0, "x2": 0, "y2": 1, "colorStops": [{"offset": 0, "color": "rgba(0,204,150,0.3)"}, {"offset": 1, "color": "rgba(0,204,150,0.02)"}]}}
    }
  ]
}

The decision boundary learned by the agent’s neural network. The line 9.26x + 8.59y - 0.03 = 0 cleanly separates the two classes. Hover over any point to see its coordinates and class.

{
  "responsive": true,
  "title": {"text": "Decision Boundary", "subtext": "Agent-written neural network (200 data points)", "left": "center"},
  "tooltip": {"trigger": "item", "formatter": "({c})"},
  "legend": {"data": ["Class 0 (negative)", "Class 1 (positive)", "Decision boundary"], "bottom": "0%"},
  "xAxis": {"type": "value", "name": "Feature 1", "nameLocation": "center", "nameGap": 30, "min": -3.5, "max": 2.5},
  "yAxis": {"type": "value", "name": "Feature 2", "nameLocation": "center", "nameGap": 40, "min": -2.5, "max": 4.5},
  "series": [
    {
      "name": "Class 0 (negative)",
      "type": "scatter",
      "symbolSize": 8,
      "itemStyle": {"color": "#636EFA", "opacity": 0.8},
      "emphasis": {"itemStyle": {"shadowBlur": 10, "shadowColor": "rgba(0,0,0,0.3)"}},
      "data": [[-0.234,-0.234],[1.579,0.767],[-0.463,-0.466],[-1.725,-0.562],[-1.013,0.314],[-0.908,-1.412],[0.068,-1.425],[-0.544,0.111],[-1.151,0.376],[-0.601,-0.292],[-0.013,-1.058],[0.823,-1.221],[0.209,-1.96],[-1.328,0.197],[-0.116,-0.301],[-1.479,-0.72],[0.344,-1.763],[0.324,-0.385],[-0.677,0.612],[-0.839,-0.309],[-0.479,-0.186],[-1.106,-1.196],[0.362,-0.645],[-2.62,0.822],[0.087,-0.299],[0.092,-1.988],[-0.808,-0.502],[-0.53,0.513],[-0.702,-0.328],[-0.392,-1.464],[0.005,-0.235],[-1.415,-0.421],[-0.343,-0.802],[-1.919,-0.027],[-0.035,-1.169],[0.791,-0.909],[-0.991,-0.566],[0.1,-0.503],[-1.551,0.069],[-1.062,0.474],[-0.783,-0.322],[0.814,-1.231],[-1.607,0.185],[-1.237,-1.32],[-0.68,0.232],[0.293,-0.714],[-1.191,0.657],[-0.975,0.787],[-0.245,-0.754],[-0.89,-0.816],[-0.847,-1.515],[0.214,-1.246],[-0.884,0.154],[0.058,-1.143],[-1.378,-0.938],[-0.773,-0.237],[-0.485,0.082],[0.686,-1.613],[0.064,-1.078],[-0.715,0.68],[-0.73,0.216],[0.046,-0.652],[-2.025,0.186],[-0.662,0.852],[-0.793,-0.115],[-1.2,-0.335],[-0.475,-0.653],[-1.261,0.918],[-1.519,-0.484],[-0.927,-0.06],[-3.241,-1.024],[-0.253,-1.248],[-0.44,0.131],[-0.982,0.462],[0.199,-0.6],[0.07,-0.385],[0.281,-0.623],[-0.829,-0.56],[-0.218,1.099],[0.324,-0.13],[-1.006,-1.214],[-0.012,-0.897],[0.076,-0.677],[-0.825,-0.321],[0.413,-0.564],[-0.822,0.244],[0.245,-0.507],[-0.471,0.232],[-1.448,-1.407],[-0.718,-0.213],[-0.019,-1.003],[-0.019,-0.289],[0.323,-0.827],[0.098,-0.773],[-0.84,-0.599],[-2.124,-0.526],[-0.759,0.15],[-0.898,0.492],[-1.713,1.354],[0.497,-0.138]]
    },
    {
      "name": "Class 1 (positive)",
      "type": "scatter",
      "symbolSize": 8,
      "itemStyle": {"color": "#EF553B", "opacity": 0.8},
      "emphasis": {"itemStyle": {"shadowBlur": 10, "shadowColor": "rgba(0,0,0,0.3)"}},
      "data": [[0.648,1.523],[-0.469,0.543],[0.242,-1.913],[-0.602,1.852],[0.738,0.171],[-0.461,1.057],[1.031,0.931],[0.331,0.976],[0.813,1.356],[-0.072,1.004],[0.361,1.538],[-0.036,1.565],[-0.22,0.357],[1.478,-0.518],[0.915,0.329],[0.097,0.969],[0.296,0.261],[-0.161,0.404],[1.886,0.175],[0.258,-0.074],[0.06,2.463],[-0.192,0.302],[1.143,0.752],[1.403,-1.402],[0.587,2.19],[-0.919,1.55],[0.227,1.307],[0.26,0.782],[0.522,0.297],[0.25,0.346],[1.866,0.474],[1.159,-0.821],[0.963,0.413],[0.822,1.897],[-0.077,0.341],[0.277,0.827],[0.013,1.454],[-0.265,2.72],[-0.223,0.714],[0.473,-0.073],[-0.447,0.856],[0.173,0.385],[0.358,0.561],[1.083,1.054],[0.515,0.514],[0.515,3.853],[0.571,1.136],[0.954,0.651],[-0.315,0.759],[2.315,-1.867],[-0.472,1.089],[2.144,0.634],[-0.67,0.852],[0.505,0.866],[1.765,0.405],[2.122,1.032],[1.267,-0.708],[0.444,0.775],[1.632,-1.43],[1.441,-1.436],[1.163,0.01],[0.114,0.662],[1.586,-1.238],[2.133,-1.952],[-0.152,0.588],[-0.589,0.85],[0.357,-0.693],[0.9,0.307],[0.813,0.63],[0.747,0.61],[-0.021,0.117],[1.278,-0.592],[0.547,-0.202],[0.825,0.814],[1.305,0.021],[0.682,-0.31],[0.097,0.595],[-0.818,2.092],[1.158,0.792],[0.624,0.628],[0.975,-0.147],[0.311,1.475],[0.858,-0.16],[0.519,1.533],[-0.109,0.402],[0.69,-0.401],[0.224,0.013],[0.025,0.498],[1.451,0.959],[2.153,-0.767],[0.872,0.183],[2.19,-0.808],[0.342,1.876],[0.95,-0.577],[-1.32,1.831],[1.179,-0.469],[-0.115,1.238],[1.466,-0.226],[0.209,0.074]]
    },
    {
      "name": "Decision boundary",
      "type": "line",
      "smooth": false,
      "symbol": "none",
      "lineStyle": {"color": "#e94560", "width": 2.5, "type": "dashed"},
      "data": [[-3.5, 3.776], [2.5, -2.692]],
      "markArea": {
        "silent": true,
        "itemStyle": {"color": "rgba(233,69,96,0.06)"},
        "data": [[{"xAxis": -3.5, "yAxis": 3.776}, {"xAxis": 2.5, "yAxis": 4.5}]]
      }
    }
  ]
}

The model got the forward pass, backward pass, and weight update all correct on the second try.

What I Learned About Small Models

Building this taught me several things about what 7B-4bit models can and cannot do.

They are good at:

  • Writing complete, self-contained scripts from detailed specifications
  • Fixing bugs when given the exact traceback
  • Restructuring code between attempts (not just patching the broken line)
  • Following formatting constraints ($12,345.67 comma separators, grid alignment)

They struggle with:

  • Consistent tool calling (they prefer to output markdown)
  • Very long generations without truncation (~60 lines seems to be the comfort zone)
  • Multi-layer backpropagation with correct tensor shapes (I tried a 2-layer neural net and it failed all 10 attempts across two runs on the chain rule math)
  • Not mixing code with natural language explanations in the same output

Design implications:

  • Extract code from whatever format the model produces. Do not depend on tool calling.
  • Keep tasks scoped so the solution fits in ~50-80 lines.
  • Give explicit formulas for math-heavy tasks (the model can implement given formulas but struggles to derive them).
  • Use the orchestrator for mechanical tasks (execution, file I/O) and the LLM for judgment tasks (generation, analysis).

The Cost Equation

{
  "responsive": true,
  "tooltip": {"trigger": "axis"},
  "xAxis": {
    "type": "category",
    "data": ["GPT-4 API", "Claude API", "Local MLX"],
    "axisLabel": {"fontSize": 13}
  },
  "yAxis": {
    "type": "value",
    "name": "USD",
    "max": 0.45,
    "nameTextStyle": {"fontSize": 13}
  },
  "series": [
    {
      "type": "bar",
      "data": [
        {"value": 0.30, "itemStyle": {"color": "#636EFA"}},
        {"value": 0.15, "itemStyle": {"color": "#EF553B"}},
        {"value": 0.00, "itemStyle": {"color": "#00CC96"}}
      ],
      "barWidth": "50%",
      "label": {
        "show": true,
        "position": "top",
        "formatter": ["~$0.30", "~$0.15", "$0.00"],
        "fontSize": 13,
        "fontWeight": "bold"
      }
    }
  ]
}

The individual run is cheap on any platform. But the cost model changes when agents get stuck. A 10-iteration retry loop that would cost $1-3 on cloud APIs costs nothing locally. You can prototype, iterate, and experiment without watching a billing dashboard. For learning and development, that changes the economics entirely.

The tradeoff is capability. A 7B-4bit model is not GPT-4. It cannot handle the same task complexity or recover from the same depth of errors. But for well-scoped tasks with clear specifications, it gets there. And it does it in under 4 minutes, fully offline, at zero cost.

Try It Yourself

# Clone and install
git clone https://github.com/thenamangoyal/multi-agent-mlx.git
cd multi-agent-mlx
uv sync

# Run a task (auto-starts the MLX server, ~4 GB RAM)
uv run factory run "Write a Python script that prints the first 20 prime numbers"

# Or run all three showcase scenarios
uv run python scenarios/run_all.py

# Stop the server when done to free RAM
pkill -f "mlx_lm.server"

Requirements: macOS on Apple Silicon, Python 3.12+, uv. The model downloads automatically on first run (~4 GB). The whole system fits comfortably in 10 GB of RAM.

{
  "responsive": true,
  "title": {"text": "Scenario Breakdown", "left": "center"},
  "tooltip": {"trigger": "item"},
  "legend": {"bottom": "0%", "data": ["Time (s)", "Attempts", "Lines of Code"]},
  "radar": {
    "indicator": [
      {"name": "Time (s)", "max": 180},
      {"name": "Attempts", "max": 5},
      {"name": "Code Lines", "max": 80},
      {"name": "Task Complexity", "max": 5}
    ]
  },
  "series": [
    {
      "type": "radar",
      "data": [
        {
          "value": [22, 1, 47, 2],
          "name": "Calendar",
          "areaStyle": {"opacity": 0.2},
          "lineStyle": {"color": "#00CC96"},
          "itemStyle": {"color": "#00CC96"}
        },
        {
          "value": [159, 3, 53, 4],
          "name": "CSV Detective",
          "areaStyle": {"opacity": 0.2},
          "lineStyle": {"color": "#636EFA"},
          "itemStyle": {"color": "#636EFA"}
        },
        {
          "value": [50, 2, 54, 3],
          "name": "Neural Net",
          "areaStyle": {"opacity": 0.2},
          "lineStyle": {"color": "#EF553B"},
          "itemStyle": {"color": "#EF553B"}
        }
      ]
    }
  ]
}

Final Thoughts

The most surprising thing about this project is not that it works. It is that the failure modes are interesting. When the Coder produces broken code, the Sheriff does not just say “it’s broken.” It identifies the error type, points to the exact line, and suggests a specific fix. And the Coder, reading that report, does not just patch the one line. It often rewrites the entire approach. That back-and-forth, mediated by a simple Python loop instead of a complex multi-agent framework, produces working code reliably enough to be useful.

The obvious next step is scaling up. A 14B or 32B model on a 64 GB Mac would handle the 2-layer backprop task that stumped the 7B model. Multiple tool calls would probably start working reliably. The architecture stays the same; you just swap the model ID. That is the beauty of building on top of mlx_lm.server: the agents do not know or care what model is behind the endpoint.

For now, the 7B model on 16 GB is the sweet spot for prototyping. It is fast enough to iterate (22 seconds for simple tasks), cheap enough to experiment freely ($0.00), and private enough to use on any codebase. Not bad for two agents arguing on a laptop.




If you found this useful, please cite this as:

Goyal, Naman (Mar 2026). Two AI Agents, One MacBook, Zero API Keys. https://namangoyal.com.

or as a BibTeX entry:

@software{goyal2026two-ai-agents-one-macbook-zero-api-keys,
  title   = {Two AI Agents, One MacBook, Zero API Keys},
  author  = {Goyal, Naman},
  year    = {2026},
  month   = {Mar},doi     = {10.5281/zenodo.19002113},
  url     = {https://namangoyal.com/blog/2026/multi-agent-factory/}
}

References

  1. NeurIPS
    Self-Refine: Iterative Refinement with Self-Feedback
    Aman Madaan, Niket Tandon, Prakhar Gupta, and 8 more authors
    arXiv preprint arXiv:2303.17651, 2023
  2. NeurIPS
    Reflexion: Language Agents with Verbal Reinforcement Learning
    Noah Shinn, Federico Cassano, Ashwin Gopinath, and 2 more authors
    arXiv preprint arXiv:2303.11366, 2023
  3. LLM
    Qwen2.5-Coder Technical Report
    Binyuan Hui, Jian Yang, Zeyu Cui, and 8 more authors
    arXiv preprint arXiv:2409.12186, 2024
  4. MLX
    Two AI Agents, One MacBook, Zero API Keys
    Naman Goyal
    Mar 2026



    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Your MacBook Can Do Autonomous AI Research Now
  • Bridging the Divide - A Mac User Guide to Productivity with Android
  • Gemma 3 Technical Deep Dive - Architecture, Performance, and Implications
  • Navigating the Social Scene - A Young Professional's Guide to the South Bay