Article 記事

Local AI Models for Production: Measuring Fine-Tuning Impact on Gemma 4

author Jonathan Conway
timestamp 6 April 2026
classification public

Your quant analyst pasted three pages of a client’s portfolio summary into ChatGPT last Tuesday. Names, account numbers, position sizes. The data is sitting on OpenAI’s servers now. Your compliance team found out this morning.

This is how most financial institutions first encounter the argument for local models. Rarely through strategic planning; usually through an incident report.

Local AI deployment has moved from research project to production reality. A consumer GPU costs $1,600. A 4B parameter model fine-tunes in under 10 minutes. A single engineer can have a specialized extraction model running in an afternoon.

This guide covers three things: how to deploy local models with opinionated recommendations instead of an encyclopedia of options, how quantization makes this practical on consumer hardware, and a complete tutorial for evaluating fine-tuning impact on Gemma 4 across two domains with real before-and-after numbers.


The Deployment Stack: Opinionated Recommendations

There are a dozen tools for running local LLMs. Most guides list all of them. That’s not useful. Here’s what to use and when.

For prototyping and development: Ollama

Ollama wraps llama.cpp in a clean interface. Install it, pull a model, start talking to it. On Linux:

curl https://ollama.ai/install.sh | sh
ollama run llama3.1:8b

It exposes an OpenAI-compatible API at localhost:11434, so any application that talks to OpenAI can talk to your local model by changing the base URL. The tradeoff is performance: Ollama adds overhead that puts it at roughly 41 tokens per second versus 60 to 120 for raw llama.cpp. For development and prototyping, you won’t notice.

For Apple Silicon: llama.cpp with Metal or LM Studio with MLX

Apple Silicon’s unified memory architecture is a genuine advantage for local inference. The CPU, GPU, and Neural Engine share the same memory pool with zero-copy access. No PCIe transfer bottleneck. An M3 Max hits 40 to 50 tokens per second on a 7B model at 50 watts. An RTX 4090 pulls 350–450 watts for equivalent throughput.

If you want a GUI with inline VRAM estimates and one-click model downloads, use LM Studio. Switch to the MLX runtime (Cmd+Shift+R) for best performance on Apple Silicon. MLX often outperforms llama.cpp through native framework optimizations.

If you want maximum control, compile llama.cpp from source:

brew install llama.cpp
# or build with Metal:
git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

An M2 Ultra with 192GB unified memory runs a 70B model at Q5_K_M quantization with room to spare. That’s a model most people think requires datacenter hardware, running on a desktop.

For production on Linux: vLLM

vLLM implements PagedAttention, a memory management algorithm that reduces KV cache waste. The result: up to 24x higher throughput than standard HuggingFace Transformers. In published benchmarks, vLLM has reached around 790 tokens per second versus Ollama’s ~40 on comparable hardware, with sub-100ms P99 latency versus several hundred milliseconds.

uv pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --port 8000

If you’re building an API that serves multiple concurrent users on NVIDIA hardware, vLLM is what you measure against.

Hardware reality check

The numbers that actually matter:

Model Size FP16 Memory Q4_K_M Memory Min GPU
7B 14 GB 4.3 GB RTX 3060 12GB
13B 26 GB 8 GB RTX 4090 24GB
70B 140 GB 38 GB 48GB+ or partial offload

A 70B model at Q4_K_M on a 24GB RTX 4090 works through partial GPU offloading: 45 to 50 of the 80 layers on the GPU, the rest on CPU RAM. You get 8 to 12 tokens per second. Slower than full GPU inference, but usable for batch processing and many interactive use cases.

Add 10 to 20 percent overhead for the KV cache and activations. Context length increases memory linearly. A 4K context window needs significantly less memory than 128K.


Quantization: The Numbers That Matter

Quantization compresses model weights from 16-bit floating point to lower precision. This is the technical breakthrough that makes local deployment viable on consumer hardware. A 70B model that needs 140GB in full precision fits in 38GB at 4-bit quantization.

The intuition is simple. Neural network weights aren’t random; they cluster around certain values. Quantization algorithms exploit this structure to represent weights with fewer bits while preserving model behavior.

GGUF K-quants: the standard for local inference

GGUF is the file format used by llama.cpp and everything built on it. K-quants use a hierarchical super-block structure that allocates different precision to different parts of the weight matrix based on sensitivity. The result is non-integer effective bits per weight with better quality-to-size ratios than uniform quantization.

Real benchmark data on Llama 3.1 8B tells the story:

Quantization Effective Bits Size GSM8K MMLU PPL Increase
FP16 16 15.3 GB 77.6% 63.5% baseline
Q8_0 8 8.1 GB 77.5% 63.4% +0.01
Q6_K 6.5 6.3 GB 78.2% 63.2% +0.03
Q5_K_M 5.5 5.5 GB 78.5% 62.8% +0.08
Q4_K_M 4.5 4.7 GB 77.4% 62.4% +0.24
Q3_K_M 3.5 3.8 GB 73.2% 62.0% +0.64

Q4_K_M is the sweet spot. A 70% size reduction for 0.24 points of perplexity. Most humans cannot tell the difference in output quality. Below Q3_K_M, degradation becomes noticeable. Above Q5_K_M, you’re paying memory for diminishing returns.

Recommendation: Use Q4_K_M as your default. Use Q5_K_M when quality is critical and you have memory to spare. Use Q3_K_M only for edge deployment where every megabyte counts. Don’t bother with Q2_K.

Other quantization methods worth understanding

AWQ (Activation-aware Weight Quantization) identifies that protecting just 1% of salient weight channels based on activation magnitudes dramatically reduces quantization error. 4-bit AWQ adds only 0.1 perplexity points versus FP16. It requires calibration data but produces hardware-friendly uniform bit-width output.

GPTQ uses approximate second-order information (the Hessian) to quantize weights one at a time, updating remaining weights to compensate for each quantization decision. On OPT-175B, 4-bit GPTQ adds only 0.03 perplexity. A 175B model quantizes in roughly 3.8 hours on an A100 (per the GPTQ paper).

TurboQuant from recent research targets KV cache compression rather than weight quantization. It achieves 3.5 bits per channel with zero quality loss using a data-oblivious two-stage approach: MSE-optimized quantization via random rotation (PolarQuant), followed by a 1-bit Johnson-Lindenstrauss transform on the residual. On an H100, attention processing runs 8x faster. This matters for long-context inference where the KV cache dominates memory.

For inference with GGUF models, K-quants are the right choice. For training, QLoRA with 4-bit quantization is the standard on both NVIDIA (BitsAndBytes) and Apple Silicon (MLX). For production serving, AWQ or GPTQ through vLLM.


Tutorial: When Fine-Tuning Works, When It Doesn’t, and How to Know

Every fine-tuning tutorial shows improvement. That’s because tutorials pick tasks where the base model is weak and training data is abundant. Real projects don’t have that luxury. Your model may already be good at the task. Fine-tuning can hurt as often as it helps, and you need to know the difference.

We learned this the hard way. We originally set out to fine-tune Gemma 4 for structured financial event extraction, pulling entities, relationships, and temporal data from news text as parseable JSON. The base model already extracted events at 93% accuracy out of the box. Fine-tuning didn’t improve it. It degraded. Every set of parameters we tried (200 steps, 60 steps, 40 steps) made the model worse. The model already knew how to extract events. We were trying to teach it something it already knew.

So we pivoted. Instead of pretending fine-tuning is a universal improvement, we ran a real experiment: establish baselines on tasks where the model is genuinely weak, fine-tune on domain-specific data, and measure the delta. This tutorial documents that experiment with real numbers. Every accuracy claim below comes from actual test runs against real benchmarks. No speculation.

Three benchmarks, three stories

We picked two domains — medical and finance — and three benchmarks with very different baseline accuracies on Gemma 4 E4B (4-bit quantized, MLX format):

Benchmark Domain Baseline Accuracy What It Tests
MedMCQA Medical QA 35.47% Factual medical knowledge across 21 specialties
FPB Financial Sentiment 78.76% Classifying financial text as positive/negative/neutral
ConvFinQA Financial Reasoning 5.91% Extracting numerical answers from financial tables and doing math

These baselines tell you everything. MedMCQA at 35% means the model lacks medical domain knowledge, so there’s room to improve. FPB at 79% means the model already understands financial sentiment, which makes it risky to fine-tune. ConvFinQA at 6% means the model cannot do financial numerical reasoning at all — the most room to improve, but also the hardest task.

The question: does fine-tuning on domain-specific data improve these numbers, and if so, by how much?


Gemma 4: the model we tested

Google released Gemma 4 on April 2, 2026 under Apache 2.0 licensing. Full commercial freedom, no usage restrictions.

Variant Active Params Total Params Context Best For
E2B 2.3B 5.1B 128K Mobile, edge
E4B 4.5B 8B 128K On-device, fine-tuning
26B-A4B (MoE) 3.8B active 25.2B total 256K Workstation inference
31B Dense 30.7B 30.7B 256K Maximum quality

We used the E4B variant: 4.5 billion active parameters, fits in 6-10 GB with QLoRA, fine-tunes in under 10 minutes on an Apple Silicon Mac with 16GB unified memory. We used the 4-bit quantized MLX format from mlx-community.

One thing to know: Gemma 4 E4B uses a vision-language model architecture even for text-only tasks. You must load it through the VLM path (FastVisionModel instead of FastLanguageModel) and use mlx_vlm for inference, not mlx_lm. Training data is still text-only. You don’t need images.

Setting up the evaluation environment

On Apple Silicon:

uv pip install 'mlx-tune>=0.4.18' mlx-vlm datasets scikit-learn tqdm numpy

That’s the full install. mlx-tune pulls in MLX and mlx-lm. The >=0.4.18 pin ensures Gemma 4 support.

Running baseline evaluations

Before any training, evaluate the base model on your benchmarks. This takes time (1-2 hours total for all three benchmarks) but tells you whether fine-tuning will help or hurt.

MedMCQA (medical QA, 500-sample stratified subset):

import mlx_lm
import datasets
import re
import numpy as np

MODEL_ID = "mlx-community/gemma-4-e4b-it-4bit"
COP_MAP = {0: "A", 1: "B", 2: "C", 3: "D"}
SYSTEM_PROMPT = (
    "You are a medical expert. Answer the following multiple-choice question. "
    "Respond with ONLY the letter of the correct answer: A, B, C, or D. "
    "Do not explain your reasoning."
)

model, tokenizer = mlx_lm.load(MODEL_ID)
ds = datasets.load_dataset("openlifescienceai/medmcqa", split="validation")

correct = 0
for example in ds:
    user_msg = f"{example['question']}\n\nA. {example['opa']}\nB. {example['opb']}\nC. {example['opc']}\nD. {example['opd']}"
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_msg},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    response = mlx_lm.generate(model, tokenizer, prompt=prompt, max_tokens=200, verbose=False)
    predicted = re.search(r'\b([ABCD])\b', response.upper())
    gold = COP_MAP.get(example["cop"])
    if predicted and predicted.group(1) == gold:
        correct += 1

print(f"MedMCQA Accuracy: {correct / len(ds) * 100:.2f}%")

FPB (financial sentiment):

from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score

ds = load_dataset("AdaptLLM/finance-tasks", "FPB", split="test")
y_true, y_pred = [], []

for ex in ds:
    blocks = ex["input"].strip().split("\n\n")
    last_block = blocks[-1]
    prefix = "Please tell me the sentiment of the following sentence: "
    sentence = last_block[len(prefix):].split("\nOptions")[0] if last_block.startswith(prefix) else last_block
    gold = ex["options"][ex["gold_index"]]

    messages = [
        {"role": "system", "content": "You are a financial sentiment analyst. Respond with ONLY one word: positive, negative, or neutral."},
        {"role": "user", "content": sentence},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    response = mlx_lm.generate(model, tokenizer, prompt=prompt, max_tokens=8, verbose=False)

    pred = None
    for label in ("positive", "negative", "neutral"):
        if label in response.lower():
            pred = label.capitalize()
            break
    if pred is None:
        pred = "Neutral"

    y_true.append(gold)
    y_pred.append(pred)

print(f"FPB Accuracy: {accuracy_score(y_true, y_pred) * 100:.2f}%")
print(f"FPB Macro F1: {f1_score(y_true, y_pred, average='macro') * 100:.2f}%")

ConvFinQA (financial numerical reasoning):

import re

ds = load_dataset("AdaptLLM/finance-tasks", "ConvFinQA", split="test")
correct = 0

for ex in ds:
    gold_num = None
    cleaned = str(ex.get("label", "")).replace(",", "").replace("%", "")
    m = re.search(r"-?\d+\.?\d*", cleaned)
    if m:
        gold_num = float(m.group())

    messages = [
        {"role": "system", "content": "Respond with ONLY the final numerical answer. No units or explanation."},
        {"role": "user", "content": ex["input"]},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    response = mlx_lm.generate(model, tokenizer, prompt=prompt, max_tokens=32, verbose=False)

    pred_num = None
    m2 = re.search(r"-?\d+\.?\d*", response.replace(",", ""))
    if m2:
        pred_num = float(m2.group())

    if pred_num is not None and gold_num is not None:
        if gold_num == 0:
            match = abs(pred_num) < 0.01
        else:
            match = abs(pred_num - gold_num) / abs(gold_num) < 0.01
        if match:
            correct += 1

print(f"ConvFinQA Accuracy: {correct / len(ds) * 100:.2f}%")

Our baseline results:

Benchmark Baseline Accuracy Samples Runtime
MedMCQA 35.47% 499 ~45 min
FPB Sentiment 78.76% 970 ~20 min
ConvFinQA 5.91% 1,490 ~23 min

Fine-tuning: two domains, same parameters

We used the same QLoRA configuration for both domains to keep the comparison fair:

  • LoRA rank: 16, alpha: 16
  • Max steps: 60
  • Learning rate: 1e-4, linear decay
  • Gradient accumulation: 4
  • Warmup ratio: 0.1
  • Max sequence length: 512
  • Trainable parameters: 36.7M (0.49% of total)

Medical: ChatDoctor dataset

We fine-tuned on 5,000 examples from ChatDoctor-HealthCareMagic-100k — real medical Q&A pairs between patients and doctors. The data has three columns: instruction, input, and output. We format it into the VLM chat format:

from mlx_tune import FastVisionModel, UnslothVisionDataCollator, VLMSFTTrainer
from mlx_tune.vlm import VLMSFTConfig
from datasets import load_dataset

BASE_MODEL = "mlx-community/gemma-4-e4b-it-4bit"

model, processor = FastVisionModel.from_pretrained(model_name=BASE_MODEL, load_in_4bit=True)

model = FastVisionModel.get_peft_model(
    model,
    r=16, lora_alpha=16, lora_dropout=0,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    bias="none", use_rslora=False, random_state=3407,
)

ds = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k", split="train", trust_remote_code=True)
ds = ds.shuffle(seed=42).select(range(5000))

train_data = []
for ex in ds:
    instruction = ex.get("instruction", "").strip()
    inp = ex.get("input", "").strip()
    output = ex.get("output", "").strip()
    if not instruction or not output:
        continue
    user_text = instruction + (f"\n\n{inp}" if inp else "")
    train_data.append({
        "messages": [
            {"role": "user", "content": [{"type": "text", "text": user_text}]},
            {"role": "assistant", "content": [{"type": "text", "text": output}]},
        ]
    })

FastVisionModel.for_training(model)

trainer = VLMSFTTrainer(
    model=model, tokenizer=processor,
    data_collator=UnslothVisionDataCollator(model, processor),
    train_dataset=train_data,
    args=VLMSFTConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_ratio=0.1,
        max_steps=60,
        learning_rate=1e-4,
        logging_steps=5,
        optim="adam",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="models/medical_lora",
        report_to="none",
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_length=512,
    ),
)

trainer.train()
model.save_pretrained("models/medical_lora")
processor.save_pretrained("models/medical_lora")

Training completed in 7.6 minutes on an M-series Mac. Loss dropped from 8.06 to 3.93 over 60 steps.

Finance: finance-alpaca dataset

Same QLoRA config, different dataset. We used 2,000 examples from gbharti/finance-alpaca — general financial instruction-following data covering topics from portfolio theory to risk management.

ds = load_dataset("gbharti/finance-alpaca", split="train", trust_remote_code=True)
ds = ds.shuffle(seed=42).select(range(2000))

train_data = []
for ex in ds:
    instruction = ex.get("instruction", "").strip()
    inp = ex.get("input", "").strip()
    output = ex.get("output", "").strip()
    if not instruction or not output:
        continue
    user_text = instruction + (f"\n\n{inp}" if inp else "")
    train_data.append({
        "messages": [
            {"role": "user", "content": [{"type": "text", "text": user_text}]},
            {"role": "assistant", "content": [{"type": "text", "text": output}]},
        ]
    })

Same trainer config, same 60 steps. Training time: similar.

Post-training evaluation: the results

After training, we switched to inference mode and re-ran the same benchmarks against the fine-tuned models. The medical model was evaluated on MedMCQA. The finance model was evaluated on both FPB and ConvFinQA.

Benchmark Baseline Post-FT Delta Direction
MedMCQA (medical FT) 35.47% 42.28% +6.81 Improved
FPB (finance FT) 78.76% 50.41% -28.35 Degraded
ConvFinQA (finance FT) 5.91% 20.94% +15.03 Improved

Three benchmarks. Three different outcomes.

MedMCQA improved by 6.8 points (a 19% relative gain, from 35.5% to 42.3%). The baseline was barely above random guessing on a 4-choice test (25%). After training on 5,000 medical Q&A examples, accuracy rose to 42.3%. This is a modest but real improvement. The model learned some medical facts it didn’t know. Unparseable responses dropped from 1.6% to 0%.

FPB collapsed by 28 points (a 36% relative drop, from 78.8% to 50.4%). The baseline model already understood financial sentiment well. After fine-tuning on finance-alpaca (which contains diverse financial instructions, not sentiment-specific data), sentiment accuracy dropped to 50.4%. This is catastrophic forgetting: the model’s existing sentiment understanding was overwritten by the broader financial training. The fine-tuning didn’t target sentiment classification specifically, so the model lost its generalization.

ConvFinQA tripled: a 3.5x relative improvement, from 5.9% to 20.9%. The baseline model essentially could not do numerical reasoning on financial data. After fine-tuning, it learned to extract and compute numbers from financial tables. But 20.9% is still poor in absolute terms. Numerical reasoning is genuinely hard for 4.5B parameter models, and 60 steps of QLoRA doesn’t fix that.

What the results mean

The pattern is clear: fine-tuning improves what the model is bad at, and degrades what it’s already good at. This isn’t a new finding in ML literature; it’s the core tension of transfer learning. But seeing it quantified on the same model with the same training parameters across different tasks makes it visceral.

The practical decision framework:

Baseline Accuracy Fine-Tuning Strategy Expected Outcome
Below 40% Train aggressively (lr=1e-4, 60+ steps) Likely improvement
40-70% Train conservatively (lr=5e-5, 30-40 steps) Possible improvement, monitor closely
Above 70% Don’t fine-tune, or use very conservative params Likely degradation

Always measure baselines first. If you skip this step, you won’t know whether fine-tuning helped or hurt. You’ll just deploy a worse model and wonder why your downstream metrics dropped.

Export and deployment

Save the LoRA adapters (small, ~700MB for Gemma 4 E4B):

model.save_pretrained("models/medical_lora")
processor.save_pretrained("models/medical_lora")

To load adapters for inference, you must reload the base model, apply the same LoRA config, and inject the adapter weights:

import mlx.core as mx
from mlx_tune import FastVisionModel
import mlx_vlm

model, processor = FastVisionModel.from_pretrained(
    "mlx-community/gemma-4-e4b-it-4bit", load_in_4bit=True
)
model = FastVisionModel.get_peft_model(model, r=16, lora_alpha=16, ...)
weights = mx.load("models/medical_lora/adapters.safetensors")
lora_keys = [k for k in weights.keys() if k.endswith((".A", ".B"))]
lora_weights = {k: weights[k] for k in lora_keys}
model.model.load_weights(list(lora_weights.items()), strict=False)
FastVisionModel.for_inference(model)

Note: save_pretrained_merged() does not reliably produce a standalone model from quantized base models with Gemma 4. The LoRA A/B matrices remain in the saved weights, and mlx_vlm.load() rejects them. The workaround is to load base + apply LoRA + inject adapters in memory, as shown above.

For GGUF export, fuse the LoRA adapters into a non-quantized base model and then export via mlx_lm.fuse:

python -m mlx_lm.fuse \
  --model mlx-community/gemma-4-e4b-it \
  --adapter-path models/medical_lora \
  --save-path models/medical_merged \
  --export-gguf \
  --gguf-path gemma-medical.Q4_K_M.gguf

GGUF export requires a non-quantized base model. Starting from the 4-bit MLX base won’t work — re-fuse from the fp16 variant (mlx-community/gemma-4-e4b-it, not the -4bit suffix).

For Ollama deployment:

FROM ./gemma-4-e4b-it.Q4_K_M.gguf
PARAMETER temperature 0.1
SYSTEM "You are a medical expert assistant."
ollama create gemma-medical -f Modelfile

Why This Matters for Agent Memory

Domain-specific models that actually understand medical or financial text are the ingestion layer for knowledge graphs. Raw text goes in. Typed entities, relationships, and temporal metadata come out. This is the data that feeds temporal graph stores, powers spreading activation retrieval, and enables the kind of multi-hop reasoning that vector search alone cannot provide.

But this only works if the model is actually good at the domain task. Our results show that fine-tuning can help — MedMCQA improved by 19% — but it can also hurt. FPB collapsed by 36% from the same training pipeline. Deploying a degraded model to production because you assumed fine-tuning would help is worse than not fine-tuning at all.

A local model paired with a local memory engine means the entire pipeline runs on your infrastructure. But measure first. Deploy second.


Technical Hurdles You’ll Hit

Catastrophic forgetting when baseline is strong. This is the most important finding in this tutorial. FPB sentiment accuracy dropped from 79% to 50% after 60 steps of QLoRA with lr=1e-4. The model’s existing capability was overwritten. The fix: measure baselines first. If accuracy is above 70%, either don’t fine-tune or use very conservative parameters (max_steps=20, lr=5e-5, r=8). If you must fine-tune for other reasons (schema alignment, output format), train on responses only and monitor held-out accuracy.

Wrong model loading path for Gemma 4. Gemma 4 is a VLM architecture. You must use FastVisionModel and VLMSFTTrainer on the training side, and mlx_vlm (not mlx_lm) for inference. If you try the language model path, loading fails or produces garbage. This applies to both mlx-tune and Unsloth.

Wrong data format for VLM training. The VLM trainer expects data as a list of dicts with "messages" key and typed content blocks ([{"type": "text", "text": "..."}]). Using the old "conversations" format or plain strings will cause the UnslothVisionDataCollator to fail.

Merged model won’t load for inference. save_pretrained_merged() on a quantized Gemma 4 model does not produce a clean standalone model. The saved weights contain LoRA A/B matrices that mlx_vlm.load() rejects. The workaround: load the base model, apply LoRA config, inject adapter weights with load_weights(strict=False), then run inference. This is the approach used in our evaluation scripts.

model.generate() doesn’t exist on merged VLM models. After loading a merged model, you must use mlx_vlm.generate(model, processor, prompt=..., max_tokens=..., verbose=False) instead of model.generate(). The generate method lives in the mlx_vlm module, not on the model object.

AdaptLLM/finance-tasks config names are case-sensitive. The correct config names are "FPB" and "ConvFinQA" — uppercase. Using lowercase "fpb" or "convfinqa" will fail with a config not found error.

FPB input format requires parsing. The input field contains a multi-paragraph block where the actual sentence is embedded. Extract the last sentence by splitting on "Please tell me the sentiment of the following sentence: " and then on "\nOptions". The gold label comes from ex["options"][ex["gold_index"]].

ConvFinQA uses input and label columns. Not output. The label column contains the numerical answer as a string. Parse it with a regex to extract the number. Compare predictions with 1% relative tolerance.

MedMCQA needs max_tokens=200. The model uses “thinking” tokens before answering. With max_tokens=8, the answer often gets cut off before the model outputs A/B/C/D. Use max_tokens=200 and extract the answer letter with re.search(r'\b([ABCD])\b', response.upper()).

NaN gradients with vision training on Gemma 4. As of mlx-tune v0.4.18, training Gemma 4 with images produces NaN gradients due to a backward pass issue in mlx-vlm. Text-only training works fine. If you need vision fine-tuning, use Qwen3.5 instead.

CUDA version mismatches. PyTorch expects a specific CUDA toolkit version. Verify with nvcc --version and nvidia-smi. Install the matching PyTorch wheel:

uv pip install torch --index-url https://download.pytorch.org/whl/cu121

Out of memory during training. VLM training requires per_device_train_batch_size=1 (forced). Use gradient_accumulation_steps to simulate larger batches. Reduce max_length to 256. If still failing on Apple Silicon, close browser tabs and other apps — unified memory is shared with everything on the system. On 8GB Macs, use Gemma 4 E2B instead of E4B.

Thermal throttling. GPUs above 85°C throttle clock speeds silently. If inference is slower than expected, check temperatures with nvidia-smi. On Apple Silicon, thermal throttling is managed automatically but visible as performance drops on sustained workloads.


Where to Go From Here

Measure before you train. This is the single most important takeaway. Run baseline evaluations on your actual task before fine-tuning. If the model is already above 70% accuracy, fine-tuning is more likely to hurt than help. If it’s below 40%, fine-tuning is likely to help.

Match training data to your benchmark. Our finance fine-tuning improved ConvFinQA (numerical reasoning) but destroyed FPB (sentiment). The finance-alpaca dataset contains diverse financial instructions, not sentiment-specific examples. If you need to improve sentiment classification, train on sentiment data. If you need numerical reasoning, train on numerical data. Generic domain data helps the tasks within that domain that the model is weak at, and can hurt the tasks it’s already good at.

Try fewer steps and lower learning rates. We used 60 steps with lr=1e-4. For tasks where the baseline is moderate (40-70%), try 20-30 steps with lr=5e-5. For tasks where the baseline is strong (>70%), consider whether fine-tuning is necessary at all. Our original financial extraction task had a 93% baseline and every amount of fine-tuning we tried made it worse.

Scale the training data. We used 5,000 medical examples and 2,000 finance examples. Both showed meaningful deltas. With 10,000+ examples and more training steps, improvements would likely be larger — but so would the risk of forgetting on tasks the model already knows.

Chain with a memory system. Domain-specific models that understand medical or financial text produce structured data that feeds knowledge graphs: entities become nodes, relationships become edges, temporal metadata enables point-in-time queries. This is the architecture that turns raw documents into queryable institutional knowledge.

Port between platforms. The mlx-tune API mirrors Unsloth. Write and test on a Mac, then change imports and model names to run on NVIDIA with Unsloth for production-scale training. The data format and training loop stay the same.


Fine-Tuning Results: Baseline vs Post-FT Across Three Benchmarks

Quantization Quality vs Size Trade-off

Local Model Deployment Decision Flow

Fine-Tuning Decision Matrix


Every accuracy number in this tutorial came from actual test runs against real benchmarks. The evaluation scripts live at github.com/deep-thinking-lab/kizuna-mem — the repository is currently private during limited client testing and will be made public the week of April 13, 2026. The complete training pipeline runs on Apple Silicon with mlx-tune or on NVIDIA GPUs with Unsloth.