16. Reproducibility
Reproducibility ensures research findings are verifiable and buildable. Local AI supports reproducibility by maintaining consistent analysis pipelines and documenting computational environments.
Environment Specification
Capture complete computational context:
# Environment documentation
def generate_environment_specification():
"""Document computational environment for reproducibility."""
spec = {
'python_version': sys.version,
'platform': platform.platform(),
'packages': pip_freeze_output(),
'hardware': {
'cpu': cpu_info,
'gpu': gpu_info if gpu_available else None,
'memory': ram_info
},
'random_seeds': {
'python': random.getstate()[1][1],
'numpy': numpy.random.get_state(),
'torch': torch.get_rng_state() if torch_available else None
}
}
return spec
Pipeline Documentation
Analysis pipelines require explicit documentation:
# Reproducible pipeline template
def create_reproducible_pipeline(steps, data_sources, outputs):
"""Generate documented analysis pipeline."""
prompt = f"""Document this analysis pipeline for reproducibility:
Pipeline Steps:
{steps}
Data Sources:
{data_sources}
Expected Outputs:
{outputs}
Include: (1) overall workflow diagram description,
(2) dependencies between steps, (3) input/output specifications,
(4) intermediate file handling, (5) validation checkpoints."""
documentation = local_model.generate(prompt)
return documentation
Containerization Support
Docker containers encapsulate complete environments:
# Research environment container
FROM python:3.11-slim
WORKDIR /research
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
git-lfs \
&& rm -rf /var/lib/apt/lists/*
# Pin package versions from pip freeze
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy research code
COPY . .
# Set random seeds
ENV PYTHONHASHSEED=42
ENV CUBLAS_WORKSPACE_CONFIG=:16:8
Random Seed Management
Ensure stochastic processes are reproducible:
def set_global_seeds(seed=42):
"""Set all random seeds for reproducibility."""
random.seed(seed)
numpy.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# For hash randomization in Python 3.7+
os.environ['PYTHONHASHSEED'] = str(seed)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create a complete reproducibility package that documents the computational environment, generates a Dockerfile for the analysis pipeline, includes seed management for all random processes, and produces a supplementary reproducibility report suitable for journal submission.