Fine-tuning Workflow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Preparation โ
โ (Collection, cleaning, formatting, train/val split) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Baseline Evaluation โ
โ (Measure base model on your tasks) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Fine-tuning โ
โ (Training, hyperparameter tuning, checkpointing) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Evaluation & Testing โ
โ (Accuracy, quality, safety, regression) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Deployment โ
โ (Serving, monitoring, iteration) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
Training Data Preparation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| from datasets import Dataset
class FineTuningDataPreparer:
def prepare_instruction_dataset(
self,
raw_data: list[dict]
) -> Dataset:
formatted = []
for example in raw_data:
formatted.append({
"instruction": example["task"],
"input": example.get("context", ""),
"output": example["response"],
"system": self.system_prompt
})
# Format for training
dataset = Dataset.from_list(formatted)
# Apply chat template
def format_chat(example):
messages = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": f"{example['instruction']}\n{example['input']}"},
{"role": "assistant", "content": example["output"]}
]
return {"text": self.tokenizer.apply_chat_template(messages)}
return dataset.map(format_chat)
|
PEFT/LoRA Training
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
def train_with_lora(base_model: str, dataset: Dataset):
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_4bit=True, # QLoRA
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Training arguments
training_args = TrainingArguments(
output_dir="./finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch"
)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
max_seq_length=2048
)
trainer.train()
return model
|
Fine-tuning Approaches
| Approach | Cost | Quality | Infrastructure |
|---|
| OpenAI Fine-tune | Medium | High | Managed |
| Full Fine-tune | High | Highest | GPU cluster |
| LoRA | Low | Very Good | Single GPU |
| QLoRA | Very Low | Good | Consumer GPU |
| Distillation | Medium | Good | Depends |
Evaluation Framework
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| class FineTuneEvaluator:
def evaluate(
self,
model,
test_set: Dataset
) -> EvaluationReport:
metrics = {
"accuracy": self.measure_accuracy(model, test_set),
"format_compliance": self.check_output_format(model, test_set),
"domain_accuracy": self.domain_specific_eval(model, test_set),
"safety": self.safety_evaluation(model),
"latency": self.measure_latency(model),
"regression": self.check_regression(model, self.base_model)
}
return EvaluationReport(
metrics=metrics,
passed=all(m > threshold for m, threshold in self.thresholds),
recommendations=self.generate_recommendations(metrics)
)
|
Technologies for Fine-tuning
- Frameworks: Hugging Face Transformers, TRL, Axolotl
- PEFT: LoRA, QLoRA, Prefix Tuning
- Managed: OpenAI Fine-tuning API, Vertex AI
- Infrastructure: Modal, RunPod, Lambda Labs
- Evaluation: LM Eval Harness, custom benchmarks
Frequently Asked Questions
What is LLM fine-tuning?
LLM fine-tuning trains a pre-trained language model on your specific data to improve performance for your use case. This creates a specialized model that understands your domain, tone, and requirements better than general-purpose models.
How much does LLM fine-tuning cost?
Fine-tuning development typically costs $120-180 per hour. A basic fine-tuning project starts around $15,000-30,000, while production fine-tuning with evaluation and iteration ranges from $40,000-100,000+. Compute costs for training are separate.
When should I fine-tune vs use prompting?
Use prompting for: most applications, when you don’t have training data, or for flexibility. Use fine-tuning for: consistent output style, domain-specific terminology, reduced token usage, or when prompting hits limits. Fine-tuning is a last resort, not first choice.
What data do I need for fine-tuning?
Typically: 100-10,000 high-quality examples in prompt-completion format. Quality matters more than quantity. I help prepare datasets: cleaning, formatting, generating synthetic examples, and ensuring diversity.
Which models can be fine-tuned?
I work with: OpenAI GPT models (API fine-tuning), open models (Llama, Mistral via LoRA/QLoRA), and cloud platforms (Vertex AI, Amazon Bedrock). The choice depends on: cost, ownership requirements, and deployment constraints.
Experience:
Case Studies:
Related Technologies: OpenAI, Claude, LangChain, Python