Florinel Chis · March 2026
Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.
The end result: A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.
Here's how I actually got there.
The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with max_seq_length=4096. You close LM Studio before training or your machine crashes.
I started with 90 training examples and grew to 261 across 6 sprints. val_loss kept dropping. By Sprint 6, it hit 0.000. Perfect.
Except the generated code wasn't getting better. At all.
The system prompt (guidelines for the model) had grown organically across sprints to 2,380 tokens. My max_seq_length was 1,500.
MLX truncates training examples silently at max_seq_length. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).
Six sprints. Hundreds of examples. Zero code learning.
# BEFORE: 2380 tokens of verbose guidelines
SYSTEM = """You are an expert Laravel developer. When writing models,
always use the HasFactory trait. The HasFactory trait enables...
[2380 tokens of examples and explanations]"""
# AFTER: 843 tokens, compressed
SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
- model: use HasFactory, add relationships from spec
- controller: import Controller, destroy() returns noContent()
..."""
And the verification I should have done from the start:
# Check that completions aren't truncated
for example in dataset:
tokens = tokenizer.encode(example["text"])
assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"
Lesson: val_loss=0.000 means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.
After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).
I discovered that every systematic bug can be fixed with 10-15 targeted examples:
| Bug | Examples needed | Result |
|---|---|---|
'optional' validation rule (not a Laravel rule) |
10 | Fixed — generates 'nullable' |
wasRecentlyCreated in resources |
5 | Fixed — uses correct timestamps |
| Cross-resource missing imports | 13 | Fixed — 12 bugs → 0 |
Missing HasFactory trait |
20 (fixed existing) | Fixed — 5 bugs → 0 |
The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.
I built an automated bug checker. It flagged StoreBookRequest $request as "missing Illuminate\Http\Request import" because the regex 'Request $request' matched as a substring.
Test your eval script on correct code before trusting it.
After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were semantic hallucinations:
user() relationship that doesn't exist->withHttpStatus() — a method that doesn't existAdding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.
Instead of natural language:
"Create a Post model with author relationship, fillable title and body, soft deletes"
I switched to structured JSON specs:
{
"artifact": "model",
"class": "Post",
"table": "posts",
"has_factory": true,
"soft_deletes": true,
"fillable": ["title", "body", "user_id"],
"relationships": [
{"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
]
}
Result: 26/26 eval perfect. Zero semantic hallucinations. (Compare: 308 NL examples still had 5 hallucinations.)
The model can't invent a user() relationship if relationships[] explicitly lists only author. The spec removes the model's ability to hallucinate about what to generate. It only decides how.
I built a compiler that validates specs before generation:
$ python3 spec_compiler.py bad_spec.json
SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.
Validation: <1ms. Generation: ~30s per file. Catch errors early.
| Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) |
|---|---|---|
| PHP valid | 26/26 | 26/26 |
| Pest pass | 52/58 | 20/20 |
| Manual fixes | 5 | 4 |
| Semantic hallucinations | 5 | 0 |
| Training time | ~30 min | ~15 min |
Distilled from 34 steps of hitting walls:
Before training:
1. Tokenize ALL examples. Check max(total_tokens) < max_seq_length
2. Check min(completion_tokens) > 0. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with vm_stat.
4. Use --num-layers 8 (not --lora-layers 8) on 16GB machines.
After training:
5. If val_loss = 0.000: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run php -l on all output (syntax check).
When bugs persist: 8. Classify: is it a training data gap or a model capability limit? 9. If data gap: write 10-15 targeted examples with diverse contexts. 10. If capability limit: change the input format (structured specs). 11. If hallucinations persist after targeted training: the problem is ontological — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.
Does well: - Individual class generation with clear patterns - PHP syntax (very rare errors after basic fine-tuning) - Following explicit rules in the system prompt - CRUD operations with a single model
Does poorly: - Multi-file consistency (imports across files) - Knowing what NOT to add (hallucinated relationships) - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) - Complex relationship traversal
The key insight: 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.
Everything is open source:
pip install mlx-lm
# Full pipeline: NL → specs → compile → PHP files
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"
# Or use a spec directly
python3 pipeline_spec.py --spec my_specs.json --output ./generated
Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.
This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.