I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked

Florinel Chis · March 2026


Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.

The end result: A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.

Here's how I actually got there.

The Hardware

The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with max_seq_length=4096. You close LM Studio before training or your machine crashes.

Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)

I started with 90 training examples and grew to 261 across 6 sprints. val_loss kept dropping. By Sprint 6, it hit 0.000. Perfect.

Except the generated code wasn't getting better. At all.

The Root Cause

The system prompt (guidelines for the model) had grown organically across sprints to 2,380 tokens. My max_seq_length was 1,500.

MLX truncates training examples silently at max_seq_length. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).

Six sprints. Hundreds of examples. Zero code learning.

The Fix

# BEFORE: 2380 tokens of verbose guidelines
SYSTEM = """You are an expert Laravel developer. When writing models,
always use the HasFactory trait. The HasFactory trait enables...
[2380 tokens of examples and explanations]"""

# AFTER: 843 tokens, compressed
SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
- model: use HasFactory, add relationships from spec
- controller: import Controller, destroy() returns noContent()
..."""

And the verification I should have done from the start:

# Check that completions aren't truncated
for example in dataset:
    tokens = tokenizer.encode(example["text"])
    assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"

Lesson: val_loss=0.000 means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.

Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)

After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).

I discovered that every systematic bug can be fixed with 10-15 targeted examples:

Bug Examples needed Result
'optional' validation rule (not a Laravel rule) 10 Fixed — generates 'nullable'
wasRecentlyCreated in resources 5 Fixed — uses correct timestamps
Cross-resource missing imports 13 Fixed — 12 bugs → 0
Missing HasFactory trait 20 (fixed existing) Fixed — 5 bugs → 0

The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.

The Eval Script Trap

I built an automated bug checker. It flagged StoreBookRequest $request as "missing Illuminate\Http\Request import" because the regex 'Request $request' matched as a substring.

Test your eval script on correct code before trusting it.

Where I Hit the Wall

After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were semantic hallucinations:

Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.

Phase 3: The Spec Pivot (The Real Breakthrough)

Instead of natural language:

"Create a Post model with author relationship, fillable title and body, soft deletes"

I switched to structured JSON specs:

{
  "artifact": "model",
  "class": "Post",
  "table": "posts",
  "has_factory": true,
  "soft_deletes": true,
  "fillable": ["title", "body", "user_id"],
  "relationships": [
    {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
  ]
}

First test: 28 examples, 100 iterations

Result: 26/26 eval perfect. Zero semantic hallucinations. (Compare: 308 NL examples still had 5 hallucinations.)

The model can't invent a user() relationship if relationships[] explicitly lists only author. The spec removes the model's ability to hallucinate about what to generate. It only decides how.

The Spec Compiler

I built a compiler that validates specs before generation:

$ python3 spec_compiler.py bad_spec.json

SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.

Validation: <1ms. Generation: ~30s per file. Catch errors early.

Final Results: adapters_spec_v4

Metric NL Pipeline (308 ex) Spec Pipeline (49 ex)
PHP valid 26/26 26/26
Pest pass 52/58 20/20
Manual fixes 5 4
Semantic hallucinations 5 0
Training time ~30 min ~15 min

The Debugging Checklist

Distilled from 34 steps of hitting walls:

Before training: 1. Tokenize ALL examples. Check max(total_tokens) < max_seq_length 2. Check min(completion_tokens) > 0. If zero, system prompt is too long. 3. Close all GPU-using processes. Check memory with vm_stat. 4. Use --num-layers 8 (not --lora-layers 8) on 16GB machines.

After training: 5. If val_loss = 0.000: training is broken, not perfect. 6. Generate 3-5 test files and inspect manually before full benchmark. 7. Run php -l on all output (syntax check).

When bugs persist: 8. Classify: is it a training data gap or a model capability limit? 9. If data gap: write 10-15 targeted examples with diverse contexts. 10. If capability limit: change the input format (structured specs). 11. If hallucinations persist after targeted training: the problem is ontological — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.

What 7B Models Do Well vs Poorly

Does well: - Individual class generation with clear patterns - PHP syntax (very rare errors after basic fine-tuning) - Following explicit rules in the system prompt - CRUD operations with a single model

Does poorly: - Multi-file consistency (imports across files) - Knowing what NOT to add (hallucinated relationships) - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) - Complex relationship traversal

The key insight: 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.

Try It Yourself

Everything is open source:

pip install mlx-lm

# Full pipeline: NL → specs → compile → PHP files
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"

# Or use a spec directly
python3 pipeline_spec.py --spec my_specs.json --output ./generated

Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.


This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.