Scientific
genomics ai
protein design

Biology & Genomics

Genomic sequence analysis, protein design, single-cell analysis. ESM, RoseTTAFold, scGPT.

Setup walkthrough

  1. Biology AI spans: protein structure prediction, genomic sequence analysis, single-cell RNA-seq, and literature reasoning.
  2. For protein structure: pip install colabfold (local AlphaFold2). Feed a FASTA sequence → predicted PDB structure. A 400-residue protein: 30-60 minutes on 12 GB GPU.
  3. For protein design: pip install proteinmpnn (ProteinMPNN — inverse folding: given a backbone structure, design an amino acid sequence that folds to it). Runs in seconds on GPU.
  4. For genomic analysis: pip install biopython + pip install esm (ESM-2, Meta's protein language model). ESM-2 computes embeddings for protein sequences used for variant effect prediction, structure prediction, and functional annotation.
  5. For single-cell RNA-seq: pip install scvi-tools (scVI — deep generative model for scRNA-seq). Trains on 100K cells in 30-60 minutes on GPU. Handles batch correction, clustering, differential expression.
  6. For biological literature Q&A: general LLMs handle PubMed-level biology questions. For specialized genomics knowledge, fine-tune or RAG over domain corpora.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs ColabFold for proteins up to 500 residues in 30-90 minutes — enough for most single-domain proteins. ESM-2 embeddings for 100K sequences in ~30 minutes. scVI training on 100K cells in 30-60 minutes. For a molecular biology lab automating routine analysis: $400-500 handles protein structure prediction, variant effect scoring, and single-cell analysis. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe (genomic datasets are large). Total: ~$400-480. Biology AI at $400 replaces cloud compute costs for routine tasks.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs ColabFold for large multi-domain proteins (800+ residues) in 60-120 minutes. ESM-2 15B (largest variant) embeddings improve downstream prediction accuracy for variant effects. For genomics-scale analysis: training scVI on 1M+ cells in 2-4 hours. For a computational biology lab: RTX 3090 handles 80% of routine workloads. The remaining 20% (genome-wide association studies, metagenomics assembly) need CPU clusters, not GPU. Total: ~$1,800-2,200. For AlphaFold-class structure prediction at scale: dual RTX 3090 for parallel predictions (predict multiple proteins simultaneously).

Common beginner mistake

The mistake: Running ColabFold on a protein sequence, getting a predicted structure, and treating it as experimentally determined ground truth. Why it fails: AlphaFold/ColabFold predictions are statistical — they predict the most likely structure given the training data. For well-studied protein families: accuracy approaches experiment (~1-2 Å RMSD). For novel proteins, disordered regions, or proteins with rare folds: predictions can be wildly wrong (10+ Å RMSD) with high confidence (high pLDDT scores on wrong structures). The confidence metric (pLDDT) correlates with accuracy on average but is unreliable for individual predictions. The fix: Always check the predicted aligned error (PAE) plot — it shows which regions are reliably positioned relative to each other. Low PAE between domains = high confidence in relative orientation. High PAE = the domains could be anywhere. For publication-quality structures, validate with experimental methods (X-ray crystallography, cryo-EM) or multiple orthogonal predictions (AlphaFold2 + ESMFold + RoseTTAFold — agreement between methods increases confidence). Predicted ≠ determined.

Recommended setup for biology & genomics

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running biology & genomics locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle biology & genomics before committing money.

Specialized buyer guides
Updated 2026 roundup