Voice to Data: LLM Pipeline for Lab Notebooks

Python

LLM

NLP

MLOps

Lab Automation

A voice-driven pipeline that turns laboratory bench recordings into structured, validated records — ASR, LLM extraction, domain validation, and a GxP-compliant audit trail. Built for pharma and biotech R&D.

Published

February 22, 2026

Overview

Scientists in formulation, analytical, and cell culture labs spend 20–40 minutes per bench session transcribing observations into ELN/LIMS systems — after the experiment, from memory, under time pressure. This pipeline removes the keyboard from the critical path.

The pipeline takes an audio file recorded at the bench and produces a structured, validated JSON record ready for ELN/LIMS handoff, with an immutable audit trail that satisfies 21 CFR Part 11 requirements.

All processing runs locally — no audio or text leaves your network, which is the only defensible configuration for regulated R&D environments.

Architecture

Voice (audio file)
     │
     ▼
┌──────────────────────────────┐
│  Stage 1 — ASR               │
│  faster-whisper (local)      │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│  Stage 2 — Extraction        │
│  Instructor + Pydantic       │
│  + Ollama (local LLM)        │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│  Stage 3 — Validation        │
│  DomainValidator             │
│  (plausibility + mandatory)  │
└──────────────┬───────────────┘
               │
               ▼
┌──────────────────────────────┐
│  Stage 4 — Serialization     │
│  JSON + AuditRecord          │
└──────────────────────────────┘

Stack

Layer	Tool	Rationale
ASR	`faster-whisper` (CTranslate2)	Open-weights, 2–4× faster than PyTorch Whisper, CPU-capable
LLM structured extraction	`instructor` + `ollama`	Guaranteed schema-conformant output via tool-calling — no JSON parsing brittleness
Schema & validation	`pydantic` v2	Type enforcement + domain plausibility checks in one place
Dependency management	`uv`	Fast, reproducible, lockfile-based

Key design decisions

Why Instructor over raw LLM JSON? Asking an LLM to “return JSON” produces unreliable free-form output. Instructor uses the model’s tool-calling mode to force every field of the Pydantic schema to be explicitly populated. Missing fields become null rather than hallucinated values — which matters a great deal in a compliance context.

Two-tier validation: Pydantic enforces structural correctness (types, enums). A separate DomainValidator class enforces scientific plausibility (pH ∈ [0, 14], viscosity ranges, unit whitelists). These are kept separate intentionally: structural failures are hard blocks; domain flags are soft and route the record to human review.

Hedged values as a first-class concept: When a scientist says “approximately 4,200 millipascal seconds”, the hedged=True flag is set in the schema and the record goes to the review queue — not committed blindly. This is a deliberate design choice, not a limitation.

Compliance notes

Audit trail: every record carries a SHA-256 hash of the raw transcript, model version strings, extraction timestamp, and review status — nothing needed for a 21 CFR Part 11 audit is missing
Model version-locking: models are pinned; updates are treated as change control events requiring requalification
Temperature = 0: extraction calls are as deterministic as the model allows
Local-only: no cloud API calls, no data processing agreement required

Links

GitHub Repository