Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.

The end result: A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.

The Hardware

The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with max_seq_length=4096. You close LM Studio before training or your machine crashes.

Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)

I started with 90 training examples and grew to 261 across 6 sprints. val_loss kept dropping. By Sprint 6, it hit 0.000. Perfect.

The Root Cause

The system prompt (guidelines for the model) had grown organically across sprints to 2,380 tokens. My max_seq_length was 1,500.

MLX truncates training examples silently at max_seq_length. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).

The Fix

# BEFORE: 2380 tokens of verbose guidelines
SYSTEM = """You are an expert Laravel developer. When writing models,
always use the HasFactory trait. The HasFactory trait enables...
[2380 tokens of examples and explanations]"""

# AFTER: 843 tokens, compressed
SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
- model: use HasFactory, add relationships from spec
- controller: import Controller, destroy() returns noContent()
..."""

# Check that completions aren't truncated
for example in dataset:
    tokens = tokenizer.encode(example["text"])
    assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"

Lesson: val_loss=0.000 means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.

Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)

After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).

I discovered that every systematic bug can be fixed with 10-15 targeted examples:

The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.

The Eval Script Trap

Bug	Examples needed	Result
`'optional'` validation rule (not a Laravel rule)	10	Fixed — generates `'nullable'`
`wasRecentlyCreated` in resources	5	Fixed — uses correct timestamps
Cross-resource missing imports	13	Fixed — 12 bugs → 0
Missing `HasFactory` trait	20 (fixed existing)	Fixed — 5 bugs → 0

I built an automated bug checker. It flagged StoreBookRequest $request as "missing Illuminate\Http\Request import" because the regex 'Request $request' matched as a substring.

Where I Hit the Wall

After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were semantic hallucinations:

Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.

Phase 3: The Spec Pivot (The Real Breakthrough)

{
  "artifact": "model",
  "class": "Post",
  "table": "posts",
  "has_factory": true,
  "soft_deletes": true,
  "fillable": ["title", "body", "user_id"],
  "relationships": [
    {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
  ]
}

First test: 28 examples, 100 iterations

Result: 26/26 eval perfect. Zero semantic hallucinations. (Compare: 308 NL examples still had 5 hallucinations.)

The model can't invent a user() relationship if relationships[] explicitly lists only author. The spec removes the model's ability to hallucinate about what to generate. It only decides how.

The Spec Compiler

$ python3 spec_compiler.py bad_spec.json

SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.

Final Results: adapters_spec_v4

The Debugging Checklist

Metric	NL Pipeline (308 ex)	Spec Pipeline (49 ex)
PHP valid	26/26	26/26
Pest pass	52/58	20/20
Manual fixes	5	4
Semantic hallucinations	5	0
Training time	~30 min	~15 min

Before training: 1. Tokenize ALL examples. Check max(total_tokens) < max_seq_length 2. Check min(completion_tokens) > 0. If zero, system prompt is too long. 3. Close all GPU-using processes. Check memory with vm_stat. 4. Use --num-layers 8 (not --lora-layers 8) on 16GB machines.

After training: 5. If val_loss = 0.000: training is broken, not perfect. 6. Generate 3-5 test files and inspect manually before full benchmark. 7. Run php -l on all output (syntax check).

When bugs persist: 8. Classify: is it a training data gap or a model capability limit? 9. If data gap: write 10-15 targeted examples with diverse contexts. 10. If capability limit: change the input format (structured specs). 11. If hallucinations persist after targeted training: the problem is ontological — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.

What 7B Models Do Well vs Poorly

Does well: - Individual class generation with clear patterns - PHP syntax (very rare errors after basic fine-tuning) - Following explicit rules in the system prompt - CRUD operations with a single model

Does poorly: - Multi-file consistency (imports across files) - Knowing what NOT to add (hallucinated relationships) - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) - Complex relationship traversal

The key insight: 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.

Try It Yourself

pip install mlx-lm

# Full pipeline: NL → specs → compile → PHP files
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"

# Or use a spec directly
python3 pipeline_spec.py --spec my_specs.json --output ./generated

This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.

I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked