COURSE · OPS · A020

Capstone: Research AI System

Learn capstone: research ai system through RunLocalAI's practical lens: capstone, research, paper and ablation, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters24hOperator trackBy Fredoline Eruo
PREREQUISITES
  • A002
  • A005
  • A006

Why this course matters

Capstone: Research AI System is for operators making local AI reliable, measurable and cheaper to run. It connects capstone, research, paper, ablation and open source to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Capstone Overview, Research Question, Related Work and Novel Architecture and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Capstone OverviewA successful capstone project requires aligning novelty, rigor, and reproducibility from day one—not as afterthoughts. This course guides you through building a complete research AI system: from initial idea to published artifact. Unlike tutorial-based courses, the capstone demands genuine contribution—something that advances the state of knowledge in your chosen area. The project lifecycle has four phases: 1. **Formulation** (Chapters 1-3): Define the research question, survey related work, and establish your contribution's novelty. 2. **Construction** (Chapters 4-5): Design the architecture and implement the system. 3. **Evaluation** (Chapters 6-9): Select baselines, design experiments, and analyze results. 4. **Communication** (Chapters 10-18): Write the paper, release code, and prepare for replication. **Novelty vs. Increment:** Your contribution need not be approach-shifting. Modifying a transformer attention mechanism with a new positional encoding scheme counts as valid novelty. What matters is that the change is intentional, justified by theory or observation, and empirically validated. **Common Pitfall:** Many operators spend months building an impressive system only to realize they cannot properly evaluate it because no established baselines exist for their task. Chapter 6 addresses baseline selection before implementation begins. **Deliverable:** By the end of this course, you will produce: - A novel architecture with documented design decisions - An open-source implementation - An ablation study isolating contribution components - A quantitative evaluation against strong baselines - A qualitative analysis explaining behavioral differences - A camera-ready paper draft10 min
  2. 02Research QuestionA well-scoped research question is specific enough to answer within your timeline but general enough to generalize beyond your specific dataset. The research question anchors everything. It determines which baselines you compare against, which metrics you optimize for, and how reviewers will evaluate your contribution's significance. **Characteristics of Good Research Questions:** **Specificity:** "Can attention heads be replaced with linear projections?" is answerable. "Will transformers work better?" is not. **Measurability:** You must define a quantitative signal that indicates progress. This typically means a metric on a benchmark dataset. **Scope Control:** "Improving accuracy by 2% on ImageNet" is achievable in 3 months. "Solving protein folding" is not. **Example Transformation:** Weak: "How can we make language models more efficient?" Strong: "Can sparse mixture-of-expert routing with top-2 selection reduce FLOPs by 40% while maintaining within 1% accuracy on WMT'14 EN-DE?" **Failure Mode:** Many operators pick research questions that require proprietary data or compute resources they cannot access. Validate feasibility before committing. **Evaluating Question Quality:** | Dimension | Poor Question | Strong Question | |-----------|----------------|-----------------| | Specificity | "improve performance" | "reduce inference latency by 2x" | | Measurability | "intuitively better" | "BLEU score on standard split" | | Feasibility | "requires 1000 GPUs" | "runs on single A100" | | Novelty | "well-studied for 5 years" | " unexplored combination" |10 min
  3. 03Related WorkRelated work sections fail when they merely list papers; they succeed when they position your contribution as the logical next step in a progression of ideas. A well-written related work section accomplishes three goals: (1) establishes the current state of the art, (2) identifies gaps your work addresses, and (3) demonstrates familiarity with relevant literature. **Structure:** 1. **Task Context:** What is the broader problem you're solving? Where does it appear in applications? 2. **Category-Based Survey:** Group prior work by approach or technique. This creates a taxonomy that helps readers understand the landscape. 3. **Gap Identification:** For each category, note limitations. Your contribution fills these gaps. 4. **Positioning:** Explain how your approach differs from the closest related work. **Practical Guidelines:** - Read 15-20 papers directly related to your question. Use Semantic Scholar or Google Scholar with citation tracking. - Create a table comparing methods across dimensions relevant to your research question. - Identify the single most related work—your primary baseline. State explicitly why you're building on or diverging from it. **Example Paragraph:** "Mixture-of-expert models (Shazeer et al., 2017) scale parameters without proportionally increasing compute. Subsequent work (Lepikhin et al., 2020) applied this to machine translation, achieving 3x inference speedup through conditional computation. However, these approaches require all experts to be active during training, limiting efficiency. Our work applies expert pruning at training initialization, reducing active parameters by 50% while maintaining convergence rates." **Failure Mode:** The "laundry list" error—describing each paper in isolation without synthesizing common themes or showing the logical progression to your work.10 min
  4. 04Novel ArchitectureArchitecture design should follow from theory or observation—never from arbitrary intuition. Document the reasoning path from problem to solution. Your novel architecture is the concrete instantiation of your research question. It consists of modules, connections, and modifications to existing approaches. The design process must be principled. **Design Methodology:** 1. **Identify Constraints:** What does your task require? What computational budget exists? 2. **Select Base Architecture:** Start from a proven foundation appropriate for your domain. 3. **Hypothesize Modifications:** What specific change addresses your research question? 4. **Validate Intuitions:** Run small-scale experiments before committing to full implementation. **Example: Novel Architecture for Efficient Attention** ```python # Base: Standard multi-head attention class StandardAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) def forward(self, x): Q = self.W_q(x) # O(n^2) attention computation return F.scaled_dot_product_attention(Q, ...) # Novel: Linear attention approximation class LinearAttention(nn.Module): """Replace O(n^2) with O(n) by approximating attention kernel.""" def __init__(self, d_model, num_heads): super().__init__() self.feature_dim = d_model // num_heads def forward(self, x): # Compute feature maps: O(n * d) phi_x = F.relu(x @ self.phi_W) # [batch, seq, feature_dim] # Softmax-free update rule: O(n * feature_dim) return self._linear_attention_update(phi_x) ``` **Common Failure Modes:** - **Overcomplication:** Adding components that don't directly serve the research question creates confounding variables for ablation studies. - **Underdocumentation:** Without clear documentation of design decisions, later chapters become speculation rather than analysis.15 min
  5. 05ImplementationImplement incrementally. Verify correctness at each step before proceeding. Debugging a fully-written system is exponentially harder than debugging modular components. Implementation transforms architecture diagrams into reproducible artifacts. The codebase must be: (1) correct, (2) efficient, (3) documented, and (4) open-sourced. **Incremental Implementation Strategy:** ``` Phase 1: Data Pipeline ├── Load dataset ├── Verify tokenization/masking ├── Check distribution statistics └── Save verification artifacts Phase 2: Architecture Modules ├── Implement single module in isolation ├── Unit test with known inputs ├── Verify gradient flow └── Profile memory usage Phase 3: Training Loop ├── Minimal training (100 steps) ├── Verify loss decreases ├── Checkpoint saving └── Learning rate scheduling Phase 4: Full Experiment ├── Reproduce baseline ├── Add novel components └── Log all hyperparameters ``` **Critical Implementation Details:** ```python # Logging configuration for reproducibility experiment_config = { "seed": 42, # Fixed for reproducibility "model_dim": 512, "lr": 1e-4, "batch_size": 32, "gradient_accumulation_steps": 4, "effective_batch_size": 128, "warmup_steps": 1000, "total_steps": 50000, } # Reproducibility boilerplate def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False # Config validation def validate_config(config): required_keys = ["seed", "model_dim", "lr", "batch_size"] for key in required_keys: if key not in config: raise ValueError(f"Missing required config key: {key}") ``` **Documentation Requirements:** - README.md with setup instructions, dependencies, and quick-start - Docstrings for all classes and public functions - Configuration schema documenting all hyperparameters - Environment file (requirements.txt or environment.yml)15 min
  6. 06Baseline SelectionBaseline selection is a strategic decision that shapes how reviewers perceive your contribution. Select baselines that are (a) strong, (b) relevant, and (c) reproducible. Choosing baselines too weak makes your improvements appear artificially large. Choosing too strong baselines may require proprietary resources. The art lies in finding the right tier. **Baseline Tiers:** | Tier | Description | Selection Criteria | |------|-------------|-------------------| | Published State-of-Art | Highest-performing published method | Use when your contribution builds on this directly | | Standard Baseline | Well-established method in the field | Use when your contribution is domain-adjacent | | Ablation Anchor | Minimal viable version of your approach | Use when isolating contribution components | **Selection Process:** 1. **Survey Literature:** Identify the top-5 performing methods on your benchmark. 2. **Assess Availability:** Check if code is released, if training is feasible within your compute budget. 3. **Verify Reproducibility:** Run reported numbers to confirm baseline implementation matches paper. 4. **Select 2-3 Baselines:** One published, one standard, one ablation anchor. **Example Baseline Selection:** Research Question: "Can linear attention replace softmax attention in NMT with <1% BLEU degradation?" Selected Baselines: - **Transformer (Vaswani et al., 2017):** Standard baseline, widely implemented - **Linear Transformer (Katharopoulos et al., 2020):** Direct comparison to prior linear attention work - **Simplified Transformer (no positional encoding):** Ablation anchor isolating attention mechanism **Failure Mode:** "Paper baseline overfitting"—tuning your method extensively while using reported numbers for baselines without verification. Always run baselines yourself.10 min
  7. 07Ablation Study DesignAblation studies are not optional—they are the primary mechanism for demonstrating that your contribution is responsible for observed improvements. An ablation study systematically removes or modifies individual components to measure their contribution. Without ablation, you cannot distinguish genuine innovation from lucky hyperparameter selection. **Ablation Categories:** 1. **Component Ablation:** Remove individual modules from your architecture. 2. **Configuration Ablation:** Vary hyperparameters of novel components. 3. **Architecture Ablation:** Replace novel modules with standard alternatives. **Design Principles:** - **Orthogonality:** Each ablation should test one variable at a time. - **Coverage:** Ablation components should cover all novel elements. - **Granularity:** Test both coarse-grained (module present/absent) and fine-grained (module with different configurations). **Example Ablation Design:** ```python # Full model configuration baseline_config = { "novel_attention": True, # Our contribution "positional_encoding": "roformer", # Our contribution "layer_norm_style": "pre", # Our contribution "dropout": 0.1, "lr": 1e-4, } # Ablation variants ablation_configs = [ {"novel_attention": False, "positional_encoding": "roformer", "layer_norm_style": "pre"}, {"novel_attention": True, "positional_encoding": "sinusoidal", "layer_norm_style": "pre"}, {"novel_attention": True, "positional_encoding": "roformer", "layer_norm_style": "post"}, # ... full grid or random sampling depending on scale ] def run_ablation_study(base_config, ablation_configs, num_seeds=3): results = [] for config in ablation_configs: for seed in range(num_seeds): merged_config = {**base_config, **config, "seed": seed} model = build_model(merged_config) metrics = train_and_evaluate(model, merged_config) results.append({"config": config, "seed": seed, **metrics}) return pd.DataFrame(results) ``` **Common Pitfalls:** - Testing ablation components jointly instead of independently (confounding effects) - Running ablations on only one random seed (unreliable estimates) - Skipping ablations because "the improvement is obvious" (reviewers will ask)15 min
  8. 08Quantitative EvaluationQuantitative evaluation is only meaningful when accompanied by statistical rigor. Report means, standard deviations, and significance tests—never single-run numbers. Rigorous evaluation requires multiple seeds, proper statistical tests, and transparent reporting of both improvements and regressions. **Statistical Framework:** ```python import scipy.stats as stats import numpy as np def evaluate_significance(results_dict, alpha=0.05): """ Compare treatment (our method) against baseline. Returns significance status and effect size. """ baseline_scores = np.array(results_dict["baseline"]) treatment_scores = np.array(results_dict["treatment"]) # Welch's t-test (does not assume equal variances) t_stat, p_value = stats.ttest_ind(treatment_scores, baseline_scores, equal_var=False) # Cohen's d for effect size pooled_std = np.sqrt((np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2) cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std return { "p_value": p_value, "significant": p_value < alpha, "effect_size": cohens_d, "baseline_mean": np.mean(baseline_scores), "treatment_mean": np.mean(treatment_scores), "improvement_pct": (np.mean(treatment_scores) - np.mean(baseline_scores)) / np.mean(baseline_scores) * 100 } ``` **Evaluation Reporting Template:** | Metric | Baseline | Ours | Δ | p-value | Notes | |--------|----------|------|---|---------|-------| | BLEU | 28.4 | 29.8 | +1.4 | 0.003 | Significant | | Params (M) | 250 | 245 | -2% | - | - | | Latency (ms) | 45 | 38 | -16% | 0.012 | Significant | | Memory (GB) | 12 | 11.5 | -4% | 0.08 | Not significant | **Common Mistakes:** - Reporting test set metrics without validation set check (overfitting to test) - Ignoring failed runs (survivorship bias in results) - Comparing single runs across different random seeds (incomparable)15 min
  9. 09Qualitative AnalysisQualitative analysis explains the "why" behind quantitative results. Without it, you report observations but not understanding. Quantitative metrics capture aggregate performance. Qualitative analysis reveals what your system actually does—and why it sometimes fails. **Qualitative Analysis Methods:** 1. **Error Analysis:** Categorize and count failure modes. 2. **Qualitative Comparison:** Side-by-side examples of your output vs. baseline. 3. **Feature Visualization:** Examine what your model learned. 4. **Ablation Behavior:** Explain performance differences through architecture differences. **Error Analysis Framework:** ```python def categorize_errors(predictions, references, model_outputs): """ Categorize errors to understand failure modes. """ categories = { "fluency_error": 0, "accuracy_error": 0, "completeness_error": 0, "hallucination": 0, "other": 0, } for pred, ref, output in zip(predictions, references, model_outputs): error_type = classify_error(pred, ref, output) categories[error_type] += 1 # Report proportions total = sum(categories.values()) return {k: v/total for k, v in categories.items()} def classify_error(pred, ref, output): """Heuristic classification of error type.""" if contains_factual_error(output): return "accuracy_error" elif is_incomplete(output, ref): return "completeness_error" elif has_grammar_error(output): return "fluency_error" elif is_hallucinated(output, ref): return "hallucination" return "other" ``` **Example Qualitative Finding:** "While our method achieves 1.4 BLEU improvement overall, error analysis reveals the improvement concentrates in long-sequence translation (+3.2 BLEU) while short sequences show marginal degradation (-0.3 BLEU). This aligns with our design hypothesis: linear attention maintains information better in long-range dependencies." **Documentation Practice:** - Include 3-5 representative examples for each major finding - Use consistent formatting for all examples - Annotate examples with explanatory comments - Report confidence when qualitative judgments are subjective15 min
  10. 10BenchmarkingEffective benchmarking separates reproducible research from wishful thinking. Without rigorous evaluation, claims about system performance are anecdotes, not evidence. Benchmarking an AI research system requires measuring behavior across multiple dimensions: accuracy, latency, throughput, memory footprint, and failure modes. Each dimension matters for different deployment contexts. ### Setting Up Evaluation Infrastructure A reliable benchmark harness needs three components: standardized datasets, automated measurement, and result persistence. ```python # benchmark_runner.py import json import time import psutil from dataclasses import dataclass from typing import Callable from pathlib import Path @dataclass class BenchmarkResult: name: str latency_p50_ms: float latency_p95_ms: float latency_p99_ms: float throughput_tokens_per_sec: float peak_memory_mb: float error_rate: float total_requests: int class BenchmarkRunner: def __init__(self, output_dir: Path): self.output_dir = output_dir self.results: list[BenchmarkResult] = [] def run(self, name: str, fn: Callable, test_cases: list, iterations: int = 100): latencies = [] errors = 0 process = psutil.Process() for i in range(iterations): for case in test_cases: start = time.perf_counter() try: fn(case) elapsed = (time.perf_counter() - start) * 1000 latencies.append(elapsed) except Exception: errors += 1 latencies.sort() n = len(latencies) result = BenchmarkResult( name=name, latency_p50_ms=latencies[n // 2], latency_p95_ms=latencies[int(n * 0.95)], latency_p99_ms=latencies[int(n * 0.99)], throughput_tokens_per_sec=self._calculate_throughput(test_cases, iterations), peak_memory_mb=process.memory_info().rss / 1024 / 1024, error_rate=errors / (iterations * len(test_cases)), total_requests=iterations * len(test_cases) ) self.results.append(result) self._persist(result) return result def _persist(self, result: BenchmarkResult): path = self.output_dir / f"{result.name}.json" with open(path, 'w') as f: json.dump(asdict(result), f) ``` ### Common Benchmarking Failures **Survivorship bias** occurs when evaluating only successful outputs. Track error rates explicitly—systems that fail 5% of the time often get ignored, but 5% failure in production creates user frustration. **Benchmark leakage** happens when training data overlaps with evaluation data. Always maintain strict separation. Use holdout datasets that the system has never seen. **Warm-up omission** skews latency measurements. GPUs and CPU caches need initialization time. Discard the first 10-20 requests before measuring. **Small sample sizes** produce unreliable metrics. Aim for at least 100 measurements per metric. Variance in AI system outputs requires larger samples than deterministic systems. ### Benchmark Suite Design Create tiered benchmarks: - **Unit benchmarks**: Single operations (tokenization, embedding lookup) - **Integration benchmarks**: Multi-step pipelines - **End-to-end benchmarks**: Complete user workflows Track regressions by maintaining historical baselines. A 2% regression on a benchmark you run weekly is visible; one you run once per project is invisible.15 min
  11. 11Technical Paper WritingA technical paper is a communication artifact, not a proof of work. The goal is helping readers understand what you built, why it matters, and how to evaluate it—efficiently. Writing for technical venues requires balancing precision with accessibility. Researchers reviewing your paper have limited time and many submissions to evaluate. Structure your writing so the core contribution is visible within the first two pages. ### Writing Principles for Systems Papers **Lead with the problem, not the solution.** Readers need to understand why anyone should care before investing time in understanding your approach. Start with a concrete pain point that practitioners face. **Quantify claims whenever possible.** "Our system is faster" is weak. "Our system processes queries at 2.3× the throughput of the baseline while maintaining equivalent accuracy" is strong. Specific numbers enable direct comparison. **Name your contributions explicitly.** Use numbered contributions: "We make the following contributions: (1) A novel approach to X that achieves Y... (2) An open-source implementation..." Reviewers look for this section first. ### Handling Uncertainty AI systems involve inherent variability. Write about performance honestly: ```python # Don't write: "Our system achieves top-performing performance." # Write: "Our system achieves 94.2% accuracy on the benchmark (baseline: 91.8%, std dev across 5 runs: ±0.6%). The improvement is statistically significant (p < 0.01, paired t-test)." # Don't write: "Our approach works well for most cases." # Write: "Our approach succeeds on 89% of test cases. The remaining 11% typically involve edge cases involving [specific characteristics]. We discuss limitations in Section 5." ``` ### Common Weaknesses in Systems Papers **Missing baselines** makes claims uninterpretable. Always compare against the current best approach and at least one simple alternative. **Missing ablation studies** makes it unclear which components matter. If you claim component X is essential, show performance without it. **Missing error analysis** wastes the opportunity to learn. Show specific examples where your system fails and analyze why. **Missing reproducibility information** frustrates readers and reviewers. Include model sizes, training durations, hyperparameters, and compute requirements. ### Revision Workflow Technical writing improves through iteration: 1. **First draft**: Get everything down, focus on content over polish 2. **Structure review**: Ensure logical flow, each section has clear purpose 3. **Sentence-level editing**: Cut unnecessary words, clarify ambiguous phrasing 4. **Technical review**: Verify all claims match experiments 5. **External review**: Have someone unfamiliar with the work read it15 min
  12. 12Paper StructureStandard paper structure exists because it works. Deviations should be deliberate, not from ignorance. Readers have mental models for where information lives—fighting those models creates friction. Academic and technical paper structure has evolved to optimize reader comprehension. Respect conventions unless you have specific reasons to deviate. ### The Classic Structure **Abstract (150-250 words)**: Compressed summary. One sentence each on motivation, approach, results, and implications. Write this last—it should reflect everything else. **Introduction (1-2 pages)**: Establish the problem, motivate why it matters, preview contributions, roadmap the paper. End with a summary of contributions, often bulleted. **Background/Related Work (1-2 pages)**: Context readers need. Define terminology, explain prior approaches, identify gaps your work fills. This section prevents reinventing explanations and shows scholarly awareness. **Approach (2-4 pages)**: Core technical content. Explain what you built with enough detail for replication but not so much that the core gets lost. Use figures to convey architecture. **Experiments (2-4 pages)**: Evaluation setup, metrics, baselines, results. Present results before interpreting them—the interpretation comes next. **Discussion (1-2 pages)**: What the results mean. Why they turned out this way. Generalizability and limitations. **Conclusion (1 paragraph)**: Summary and future directions. Don't introduce new claims. ### Section-Level Organization Within sections, use the "topic sentence + support" pattern: ```markdown ## 3.2 Semantic Indexing The semantic index stores compressed representations of document content to enable approximate similarity search. [Topic sentence] Our index consists of two components: [supporting detail 1] The index is built offline during document ingestion... [supporting detail 2] Updates to the index are batched to amortize overhead... [supporting detail 3] ``` Each paragraph should have one main idea expressed in the first sentence. Supporting sentences provide evidence, examples, or elaboration. If a paragraph's first sentence doesn't summarize it, restructure. ### Figures and Tables as Arguments A figure showing performance comparison makes a claim. The caption must state the claim: ```markdown # Bad caption: "Figure 3: Performance results" # Good caption: "Figure 3: Our system achieves 2.3× higher throughput than the baseline across all batch sizes while maintaining <1% accuracy degradation (see Section 4.2 for accuracy breakdown)." ``` ### Length Management Most venues have strict page limits. Allocate space proportionally: - Problem and motivation: 10-15% of length - Approach: 30-40% of length - Experiments: 30-40% of length - Related work and conclusion: 10-20% of length Cut by removing redundant explanations, not by shrinking text. A tight, complete paper beats an expanded, padded one.20 min
  13. 13Experimental ResultsExperimental results are evidence, not decoration. Every number in a results section should support a specific claim, and every claim should require supporting evidence. Present results to build an argument: these experiments, when interpreted correctly, support these conclusions. Readers should be able to trace from claim to evidence. ### Organizing Experiments Group experiments by the claim they support: ``` ## 4. Experimental Results ### 4.1 End-to-End Performance (supports main claim) ### 4.2 Ablation Study (supports claim that components matter) ### 4.3 Error Analysis (supports claim about failure modes) ### 4.4 Scaling Behavior (supports claim about generalization) ``` This structure makes it easy for readers to find evidence for specific claims. ### Quantitative Presentation Use tables for comparisons with baselines: ```python # Results table for paper results_table = """ | System | Accuracy | Latency (ms) | Memory (GB) | |--------|----------|--------------|-------------| | Baseline | 91.8% | 234 | 8.2 | | Ours | 94.2% | 198 | 7.1 | | +Compression | 94.0% | 145 | 4.3 | | +Pruning | 93.1% | 89 | 2.8 | """ ``` Use figures for trends and distributions: ```python import matplotlib.pyplot as plt # Scaling behavior visualization def plot_scaling_results(): fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # Left: accuracy vs dataset size axes[0].plot(dataset_sizes, accuracies, 'o-', label='Ours') axes[0].plot(dataset_sizes, baseline_accuracies, 's--', label='Baseline') axes[0].set_xlabel('Training Data Size') axes[0].set_ylabel('Accuracy') axes[0].legend() # Right: latency vs batch size (log-log) axes[1].loglog(batch_sizes, latencies, 'o-', label='Ours') axes[1].loglog(batch_sizes, baseline_latencies, 's--', label='Baseline') axes[1].set_xlabel('Batch Size') axes[1].set_ylabel('Latency (ms)') plt.tight_layout() plt.savefig('figures/scaling_results.pdf') ``` ### Statistical Rigor AI systems have inherent variance. Report uncertainty: - **Standard deviation**: For repeated runs with different random seeds - **Confidence intervals**: For estimated population parameters - **Statistical tests**: For comparing systems (t-test, bootstrap) ```python # Calculate and report confidence intervals def report_with_ci(values, confidence=0.95): import numpy as np from scipy import stats mean = np.mean(values) sem = stats.sem(values) # Standard error of mean ci = stats.t.interval(confidence, len(values)-1, loc=mean, scale=sem) return f"{mean:.2f} ± {sem * 1.96:.2f} (95% CI)" ``` ### Handling Negative Results Don't hide experiments where your approach didn't win. Negative results are valuable information: > "Surprisingly, adding retrieval augmentation decreased performance on tasks requiring precise factual recall (Table 4, rows 3-4). Analysis reveals that retrieved passages occasionally contained contradictory information that confused the model. We address this in Section 5.2."20 min
  14. 14VisualizationA good figure communicates a finding in seconds. A bad figure requires minutes of explanation. Invest in visualization—the effort pays back every time someone reads your work. Figures are often the first thing readers examine and the most memorable element of technical papers. Poor visualization undermines otherwise strong technical work. ### Principles of Effective Scientific Figures **Reduce cognitive load.** Every visual element should help communicate the finding. Remove decorative elements, grid lines, and chart junk. **Use appropriate encodings.** Position (x, y coordinates) is most accurately perceived. Color should be reserved for categorical distinction, not quantitative value—readers perceive color differences non-linearly. **Maximize data-ink ratio.** The Data-Ink Ratio is the proportion of ink used for actual data. Eliminate decorative borders, backgrounds, and redundant labels. ### Common Visualization Types for AI Systems **Bar charts**: Compare discrete quantities. Use when comparing 2-10 items. ```python import matplotlib.pyplot as plt import numpy as np def plot_comparison(): systems = ['Baseline', 'Ours', '+Tuning', '+Augmentation'] accuracy = [91.8, 94.2, 95.1, 95.8] errors = [0.6, 0.5, 0.4, 0.4] # Standard deviation x = np.arange(len(systems)) bars = plt.bar(x, accuracy, yerr=errors, capsize=5, color='steelblue') plt.ylabel('Accuracy (%)') plt.xticks(x, systems) plt.ylim(88, 98) # Add value labels on bars for bar, val in zip(bars, accuracy): plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, f'{val:.1f}', ha='center', va='bottom', fontsize=10) plt.tight_layout() ``` **Line plots**: Show continuous relationships. Use when x-axis has meaningful order (dataset size, model size, iteration). **Scatter plots**: Show correlations or distributions. Essential for error analysis. ### Architecture Diagrams System architecture figures need clarity: ```mermaid graph TB A[Input Query] --> B[Retrieval Module] B --> C[Context Builder] D[Document Store] --> B C --> E[Inference Engine] E --> F[Response Formatter] F --> G[Output] style B fill:#e1f5fe style E fill:#e8f5e8 ``` Include enough detail that readers understand data flow, but not so much that the main structure is obscured. Show the happy path first; error handling and edge cases can be described in text. ### Color Palette Considerations Choose palettes that work for colorblind readers. The "jet" colormap common in Python hides information for 8% of men. Use viridis, cividis, or a custom palette designed for accessibility. If you must use color to encode information, add redundant encoding (shape, pattern) so the figure is interpretable in grayscale.20 min
  15. 15Open-Source ReleaseReleasing code is not the same as enabling reproduction. Code without environment specification, usage examples, and maintenance is a liability for both users and authors. Open-sourcing your research system extends its impact beyond readers who can reconstruct your work from paper descriptions. However, sloppy releases create more problems than they prevent. ### What to Release **Core implementation**: The code that produces your results. Clean it enough for others to read, but don't over-abstract—readability matters more than elegance in research code. **Evaluation scripts**: The exact commands used to generate your benchmark numbers. Reproducibility requires more than "run the evaluation script"—include the exact parameters, seeds, and environment. **Trained models/weights**: If feasible. Large models may be too big for standard hosting, but consider model cards and pointers to external hosting. **Data splits**: The exact train/validation/test splits, especially if you're releasing a new benchmark. Without the splits, others cannot reproduce your evaluation. ```python # release_checklist.py RELEASE_CHECKLIST = """ [ ] Code runs without modifications on a fresh environment [ ] All dependencies specified with versions [ ] README with: [ ] Installation instructions [ ] Quick start example [ ] Expected hardware requirements [ ] License [ ] Evaluation reproduces claimed results [ ] Known limitations documented [ ] Issues or PRs monitored and responded to [ ] Model card (if releasing models) """ ``` ### Repository Structure ``` research-system/ ├── README.md # Overview, quick start ├── LICENSE # AGPL-3.0, MIT, Apache 2.0 ├── requirements.txt # Exact dependencies ├── setup.py # Installation ├── docs/ # Full documentation │ ├── getting_started.md │ ├── api_reference.md │ └── evaluation.md ├── src/ # Source code │ └── research_system/ ├── examples/ # Usage examples ├── scripts/ # Evaluation and training scripts ├── configs/ # Default configurations └── tests/ # Unit tests ``` ### License Selection Choose a license appropriate for your goals: - **MIT/Apache 2.0**: Permissive, maximizes adoption. Good for tools others will incorporate into larger projects. - **GPL-3.0**: Requires derivative works to be open source. Good if you want to prevent proprietary competition. - **CC BY 4.0**: For documentation and model cards, not code. AGPL-3.0 is increasingly common for AI systems—it's GPL but covers network use, important for hosted services. ### Maintenance Expectations Open-source comes with obligations. Before releasing, decide: - Will you respond to issues and PRs? On what timeline? - Will you accept external contributions? - Will you maintain compatibility across versions? Being clear about limitations ("this is a research release, not production-ready") manages expectations and reduces support burden.20 min
  16. 16DocumentationDocumentation is a user interface. Just as you would test user-facing interfaces, you should test documentation—through the eyes of someone who doesn't know your system. Documentation transforms code from "exists" to "usable." A well-documented system with mediocre code outlasts a poorly-documented system with excellent code. ### Documentation Tiers **API reference**: Complete, accurate descriptions of every public function, class, and module. Generated from docstrings using tools like Sphinx or pdoc. **Tutorial**: Step-by-step guide walking new users through a complete use case. Assumes minimal prior knowledge. Should be completable in 15-30 minutes. **How-to guides**: Solutions for specific common tasks. "How to evaluate on a custom dataset." "How to add a new model architecture." Assumes some familiarity. **Explanation**: Conceptual discussion of why the system works as it does. Architecture decisions, design trade-offs, background concepts. ```python # docstring_example.py def retrieve(query: str, top_k: int = 10, filters: dict = None) -> list[Document]: """Retrieve documents relevant to a query. Args: query: Natural language search query. top_k: Maximum number of documents to return. Default 10. filters: Optional metadata filters. Keys are field names, values are acceptable values. Default None. Returns: List of Document objects sorted by relevance score (descending). Empty list if no documents match. Raises: ValueError: If top_k < 1 or query is empty. ConnectionError: If the document store is unreachable. Example: >>> docs = retrieve("transformer architecture", top_k=5) >>> print(f"Found {len(docs)} documents") Found 5 documents """ ``` ### Writing for Users Technical documentation serves users with different goals: - **Evaluators**: Want to understand if the system meets their needs before committing time. Provide high-level overview, benchmark results, known limitations. - **Integrators**: Want to build on your system. Provide API reference, architecture overview, extension points. - **Debuggers**: Have a broken system. Provide troubleshooting guides, common error messages and solutions, logging configuration. ### Documentation Testing Documentation rots without maintenance. Test it: ```python # docs/test_docs.py import subprocess import pytest from pathlib import Path def test_readme_install(): """Verify README installation commands work.""" # This test validates that the documented install process works result = subprocess.run( ["pip", "install", "-e", "."], capture_output=True, timeout=300 ) assert result.returncode == 0, f"Installation failed: {result.stderr}" def test_examples_run(): """Verify all example scripts execute without error.""" examples_dir = Path("examples") for example in examples_dir.glob("*.py"): result = subprocess.run(["python", example], capture_output=True) assert result.returncode == 0, f"{example} failed: {result.stderr}" ``` ### Maintenance Practices - Treat documentation updates like code reviews—require PRs, review changes - Add "documentation needed" labels to issues when code lacks docs - Review docs during feature development, not after20 min
  17. 17Community PresentationA conference talk is not a paper read aloud. The best talks show what papers cannot—enthusiasm, intuition, and the human story of discovery. Presenting to an audience requires different skills than writing for readers. People cannot re-read a confusing sentence; they can only ask confused questions or zone out. ### Talk Structure **Hook (30 seconds):** Why should the audience care? Start with a problem, demo, or provocative question—not with "I'm going to talk about X." **Context (2-3 minutes):** What did people do before your work? What are the limitations? This establishes why your work matters. **Approach (5-8 minutes):** What did you build? Show architecture, key innovations, design decisions. Use visuals, not bullet points. **Results (3-5 minutes):** What did it achieve? Show numbers, comparisons, demos. Let the data speak. **Implications (2-3 minutes):** What does this mean for the field? What can others learn or build on? **Q&A preparation (implicit):** Anticipate questions and prepare answers. ### Visual Design for Talks Slides should support the spoken content, not repeat it: ```python # BAD: Bullet points that repeat what's said """ - Our system uses a novel retrieval mechanism - It achieves 94% accuracy - It's 2x faster than the baseline """ # GOOD: Visual that the speaker elaborates on """ [Architecture diagram] "The key innovation is the indexing layer here— instead of brute-force search, we use..." """ ``` Use high-contrast colors, large fonts (minimum 24pt for body text), and minimal text. One idea per slide, maximum. ### Handling Nerves Presentation anxiety is universal. Mitigation strategies: - **Practice until boring**: Know the material so deeply that nerves don't affect delivery - **Video record yourself**: Identify nervous habits (filler words, pacing) that you don't notice in the moment - **Arrive early**: Familiarity with the room reduces anxiety - **Breathe**: Slow, deliberate breathing before starting resets the nervous system ### Live Demos Demos are high-risk, high-reward. A working demo creates memorable impact. A broken demo creates memorable failure. ```python # Demo safety checklist DEMO_CHECKLIST = """ [ ] Demo runs correctly in the presentation environment [ ] Backup recording exists if live demo fails [ ] Demo data is appropriate for presentation context [ ] All necessary resources are accessible offline [ ] Timer set to know when to advance without counting """ ```20 min
  18. 18Research System ProjectBuilding a research system synthesizes everything in this course: problem definition, system design, implementation, evaluation, and communication. The process reveals gaps that isolated exercises cannot. This final chapter provides a structured project that applies the course material holistically. The project scope is deliberately bounded—sufficient for demonstration, not publication. ### Project Specification **Objective**: Build a research system that answers questions using a retrieval-augmented approach over a domain-specific corpus. **Core Components**: ```python # project_architecture.py """ research_system/ ├── src/ │ ├── __init__.py │ ├── retrieval/ # Document retrieval module │ │ ├── __init__.py │ │ ├── indexer.py # Build document index │ │ └── searcher.py # Query index │ ├── generation/ # Answer synthesis module │ │ ├── __init__.py │ │ └── synthesizer.py # Combine retrieved context │ └── evaluation/ # Assessment module │ ├── __init__.py │ └── metrics.py # Accuracy, latency, coverage ├── tests/ │ ├── test_retrieval.py │ ├── test_generation.py │ └── test_integration.py ├── docs/ │ ├── README.md │ ├── architecture.md │ └── evaluation.md ├── scripts/ │ ├── index_corpus.py │ └── run_benchmark.py ├── data/ │ └── sample_corpus/ # Domain-specific data └── requirements.txt """ # Key interfaces class DocumentIndex: def build(self, documents: list[Document]) -> None: """Build index from documents.""" ... def search(self, query: str, top_k: int) -> list[tuple[Document, float]]: """Search index for relevant documents.""" ... class AnswerSynthesizer: def __init__(self, model_path: str): """Initialize with specified model.""" ... def generate(self, question: str, context: list[Document]) -> str: """Generate answer given question and context.""" ... class EvaluationSuite: def run(self, system: ResearchSystem, test_set: TestCase) -> EvaluationResult: """Run full evaluation.""" ... ``` ### Requirements 1. **Retrieval**: Index a corpus of at least 1,000 documents and retrieve relevant documents for arbitrary queries with >70% precision at top-5 2. **Generation**: Generate coherent answers that incorporate retrieved context; no hallucinated facts not supported by context 3. **Evaluation**: Produce quantitative metrics including accuracy, latency, and retrieval precision; compare against a simple baseline (e.g., TF-IDF retrieval) 4. **Documentation**: README with installation, usage, and architecture description; inline documentation for all public interfaces 5. **Benchmarking**: Measure performance across at least 100 queries; report latency distribution and accuracy metrics ### Evaluation Criteria | Component | Criteria | Weight | |-----------|----------|--------| | Retrieval | Accuracy, relevance quality | 25% | | Generation | Answer quality, faithfulness to context | 25% | | Code Quality | Structure, documentation, tests | 20% | | Evaluation | Rigorous benchmarking, statistical reporting | 15% | | Communication | README clarity, presentation | 15% | ### Common Pitfalls **Over-engineering the index**: Start simple. A working TF-IDF baseline with 60% accuracy is better than a broken dense retriever with theoretical 90% accuracy. **Skipping the baseline**: Without comparison, results are uninterpretable. Always have a simple baseline to beat. **Ignoring latency**: A system that works but takes 30 seconds per query won't be used. Measure and optimize. **Undocumented limitations**: Be explicit about what your system cannot do. This is not weakness—it's honest engineering.15 min