Problem
ICD (International Classification of Diseases) coding is the process of translating clinical narratives into standardised diagnosis codes. Every hospital encounter, every radiology report, every discharge summary must be coded before a claim can be filed. This is expensive, slow, and error-prone — trained medical coders work manually, and miscoding directly costs hospitals money and distorts health statistics.
The task is hard for automation because: clinical text is noisy and abbreviated, a single report can trigger multiple codes simultaneously (multi-label), and the ICD taxonomy has thousands of fine-grained codes with subtle distinctions.
Dataset
I used MIMIC-CXR, a large publicly available dataset of de-identified chest X-ray radiology reports from Beth Israel Deaconess Medical Center. Each report is paired with one or more ICD codes describing the findings.
The dataset pipeline expands multi-label rows into anchor-positive pairs: the clinical text query is matched against ICD semantic descriptions, setting up a contrastive bi-encoder training objective.
Approach
The core architecture is a contrastive bi-encoder:
- Query encoder: encodes the clinical report
- Key encoder: encodes the ICD code’s semantic description
- At inference, retrieval is pure dot product similarity over precomputed key embeddings — fast and scalable
This separates encoding from retrieval cleanly. The model doesn’t need to see all codes at training time — it learns a shared embedding space where clinical text and code descriptions are geometrically close when semantically related.
Key decisions
Why BioClinical-ModernBERT?
Standard BERT was trained on Wikipedia and BookCorpus — no clinical language. BioClinicalBERT improved this with MIMIC-III pretraining, but its 512-token limit is a hard constraint for longer reports.
ModernBERT is a 2024 re-architecture of BERT with Flash Attention 2, a much longer context window (8192 tokens), and significantly better benchmark performance. BioClinical-ModernBERT brings that architecture to the clinical domain. Longer reports no longer need to be truncated, and attention is computed more efficiently.
Why Matryoshka Representation Learning?
Standard embeddings have a fixed dimension. If you train a 768d model, you’re stuck using 768d at inference. Matryoshka Representation Learning (MRL) trains a single model to produce embeddings that are simultaneously meaningful at multiple nested dimensions — [64, 128, 256, 768] in this project.
The key insight: the first 64 dimensions encode a coarse representation; the first 128 encode a finer one; and so on. You can truncate to any prefix and still get a usable embedding.
For ICD coding this matters because:
- Fast first-pass retrieval can use 64d to filter candidates
- Re-ranking uses the full 768d for precision
- No retraining needed when switching dimension — one model, multiple operating points
The MRL loss is a weighted sum of contrastive losses at each nesting dimension, computed simultaneously during a single forward pass.
Label-Aware Attention (LAA)
Standard mean-pooling aggregates the full sequence into one vector. For multi-label coding, different parts of the report are evidence for different codes. LAA adds a per-label attention mechanism that learns to focus on the clinical tokens most relevant to each candidate code — extracting targeted evidence rather than blending everything.
Ablation study
Three variants were evaluated to isolate the contribution of each component:
| Variant | Description |
|---|---|
| LAA | Full model with Label-Aware Attention |
| Standard Attention | Same architecture, mean-pool instead of LAA |
| Retrieval Bi-Encoder | No LAA, standard contrastive training |
Evaluation metrics: Micro-F1, ROC-AUC, Precision@5. Experiment tracking via Weights & Biases.
Reflection
The most interesting tension in this project was between retrieval efficiency and clinical precision. MRL resolves it elegantly — you don’t have to choose a fixed operating point at training time, you choose it at inference. That’s a genuine engineering win for production medical systems where latency and accuracy requirements shift by deployment context.
LAA adds complexity but addresses a real structural property of clinical text: a chest X-ray report’s finding about atelectasis is localised to a specific sentence, not spread across the whole document. Making the model aware of that locality is the right inductive bias.
