09. Data Formatting
Formatting determines how the model interprets training examples. The model learns to produce outputs that match the formatting it observes in training data. Inconsistent or confusing formatting degrades the model's ability to follow instructions correctly.
For instruction-tuning, a common format includes system, user, and assistant message roles with clear delimiters. The model learns that content between specific markers represents different message types and should be handled differently. This segmentation teaches the model conversational structure.
The chosen format should match the format the model will encounter at inference time. Fine-tuning on one format and prompting with another creates mismatch that confuses the model. When deploying adapters, ensure inference code uses identical formatting conventions.
Special tokens play a crucial role in formatting. These tokens (often represented as <s>, </s>, [INST], [/INST], or similar) mark boundaries between different content types. The tokenizer must recognize these tokens and the vocabulary must include them. Most instruction-tuned models include appropriate special tokens.
Conversation formats vary across model families. Llama models typically use a specific template with [INST] and [/INST] markers. Mistral models use similar conventions with variations. Vicuna and related models use yet another format. Training on the wrong format for a given model produces poor results.
Handling multi-turn conversations requires deciding how to structure context. Options include including full conversation history (higher memory, better context) or only the current turn (lower memory, less context). Most fine-tuning pipelines truncate to a maximum sequence length, cutting off older turns when necessary.
Implement a formatter that converts raw conversation data into tokenized sequences for a specific model family. Verify the output matches expected special token placement.
# data_formatter.py
from typing import List, Dict, Optional
class ChatFormatter:
"""Format conversations for instruction-tuning."""
def __init__(
self,
tokenizer,
system_template: str = "Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction:\n{system}\n\n### Input:\n{input}\n\n### Response:\n{response}",
system_message: str = "You are a helpful assistant."
):
self.tokenizer = tokenizer
self.system_template = system_template
self.system_message = system_message
def format_single_turn(
self,
instruction: str,
input_text: str,
response: str
) -> Dict[str, str]:
"""Format a single instruction-input-response example."""
if input_text:
formatted = self.system_template.format(
system=instruction,
input=input_text,
response=response
)
else:
formatted = self.system_template.format(
system=instruction,
input="N/A",
response=response
)
return {"text": formatted}
def format_conversation(
self,
messages: List[Dict[str, str]],
add_generation_prompt: bool = True
) -> str:
"""
Format a multi-turn conversation using model-specific tokens.
Example for Llama/Mistral style models.
"""
result = ""
for i, msg in enumerate(messages):
role = msg.get("role", "user")
content = msg["content"]
if role == "system":
result += f"<<SYS>>\n{content}\n<</SYS>>\n\n"
elif role == "user":
result += f"[INST] {content} [/INST]"
elif role == "assistant":
result += f"{content}</s>\n"
if add_generation_prompt and messages[-1].get("role") == "user":
# Add generation prompt marker
result += "[INST] "
return result.strip()
def tokenize_for_training(
self,
example: Dict[str, str],
max_length: int = 2048
) -> Dict[str, List[int]]:
"""
Tokenize formatted text for training.
Returns input_ids with labels (masked non-response tokens).
"""
text = example["text"]
# Tokenize entire sequence
tokenized = self.tokenizer(
text,
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors=None
)
# Find where the response starts
response_marker = "### Response:\n"
response_start = text.find(response_marker)
if response_start == -1:
# Mask entire sequence if no response marker found
tokenized["labels"] = [-100] * len(tokenized["input_ids"])
return tokenized
# Calculate token offset to response
response_text_start = response_start + len(response_marker)
# Find token position where response begins
# This is approximate; tokenizer-dependent
prefix = text[:response_text_start]
prefix_tokens = len(self.tokenizer.encode(prefix))
# Create labels: mask non-response tokens
input_ids = tokenized["input_ids"]
labels = [-100] * len(input_ids)
for i in range(prefix_tokens, len(input_ids)):
labels[i] = input_ids[i]
tokenized["labels"] = labels
return tokenized
# Verify formatting
def verify_formatting(formatter: ChatFormatter, tokenizer):
"""Verify special tokens are handled correctly."""
example_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
formatted = formatter.format_conversation(example_messages)
print("Formatted conversation:")
print(formatted)
print()
tokens = tokenizer.encode(formatted, add_special_tokens=False)
decoded = tokenizer.decode(tokens)
print("Re-decoded matches original:", formatted.strip() == decoded.strip())