HOW-TO · SUP

How to Set Up Batch Processing for Large Document Sets

intermediate30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Python, document processing libraries, LLM endpoint

What this does

Batch processing handles thousands of documents by splitting them into parallel batches, enforcing rate limits to avoid API throttling, tracking progress for resumability, and recovering gracefully from individual document failures without restarting the entire job.

Steps

Step 1 — Chunk the document list into batches.

Load the full list of document identifiers or file paths. Split into batches of N items, where N is tuned to the API's rate limit (e.g., 50 documents per batch for an LLM endpoint with a 1,000 requests-per-minute ceiling). Store the total batch count as total_batches.

Step 2 — Implement a worker function for a single batch.

Create a function that takes a batch of documents, calls the LLM endpoint for each document (or constructs a single batch prompt if the API supports batch input), and returns a list of results paired with document IDs. Catch all exceptions at the document level so a single failure does not abort the batch.

Step 3 — Add a semaphore for rate limiting.

Use a semaphore (or equivalent concurrency primitive) to limit the number of concurrent workers to a safe maximum, such as 5 concurrent calls. This prevents the process from exhausting connection limits or hitting server-side throttling.

Step 4 — Track progress state.

Maintain a lightweight state file (JSON) that records the current batch index, completed document IDs, and a failed-document list. Update this file after each batch. On restart, read the state file and resume from the last completed batch.

Step 5 — Handle individual document failures.

If a document fails (exception, timeout, or non-200 response), append it to a retry queue with a retry count. After three failed attempts, move it to a dead-letter file and continue. Do not let failures propagate to other documents in the same batch.

Step 6 — Aggregate and store results.

After all batches complete, merge the per-document results with their original metadata. Write the output to a structured file (JSON Lines or Parquet). Log a summary: total processed, total succeeded, total failed, total elapsed time, and average latency per document.

Step 7 — Run with dry-run validation first.

Before processing the full corpus, run the pipeline on a 10-document sample. Verify output schema, latency, and error handling. Adjust batch size and concurrency settings based on observed throttling behavior.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

  • Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.

  • Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.

  • Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

  • Process a corpus of 500 documents. Confirm the output file contains exactly 500 result entries with valid document IDs.
  • Kill the process after 200 documents. Restart. Confirm processing resumes from batch 201 and the output file grows to 500.
  • Introduce a 5% failure rate in the LLM endpoint. Confirm failures are captured in the dead-letter file and do not stop processing of healthy documents.

Common failures

  • No progress tracking: Without a state file, restarting wipes all progress. Always persist state between batches.
  • Burst rate limiting: Even if the average rate is within limits, a burst of concurrent calls can trigger server-side 429 errors. Reduce the semaphore count and add a short jitter delay between batches.
  • Result mismatches: If the output order does not match the input order or document IDs are dropped, the final dataset is corrupt. Validate ID coverage with a set comparison after the run.

Related guides

  • How to Implement Vector Search with Metadata Filtering — the output from batch processing is often fed into a vector index
  • How to Set Up Model Fallback Chains (Local to Cloud) — provides a reliable LLM endpoint for batch workers