Matryoshka-ICD

Problem

ICD (International Classification of Diseases) coding is the process of translating clinical narratives into standardised diagnosis codes. Every hospital encounter, every radiology report, every discharge summary must be coded before a claim can be filed. This is expensive, slow, and error-prone — trained medical coders work manually, and miscoding directly costs hospitals money and distorts health statistics.

The task is hard for automation because: clinical text is noisy and abbreviated, a single report can trigger multiple codes simultaneously (multi-label), and the ICD taxonomy has thousands of fine-grained codes with subtle distinctions.

Dataset

I used MIMIC-CXR, a large publicly available dataset of de-identified chest X-ray radiology reports from Beth Israel Deaconess Medical Center. Each report is paired with one or more ICD codes describing the findings.

The dataset pipeline expands multi-label rows into anchor-positive pairs: the clinical text query is matched against ICD semantic descriptions, setting up a contrastive bi-encoder training objective.

Approach

The core architecture is a contrastive bi-encoder:

Query encoder: encodes the clinical report
Key encoder: encodes the ICD code’s semantic description
At inference, retrieval is pure dot product similarity over precomputed key embeddings — fast and scalable

This separates encoding from retrieval cleanly. The model doesn’t need to see all codes at training time — it learns a shared embedding space where clinical text and code descriptions are geometrically close when semantically related.

Key decisions

Why BioClinical-ModernBERT?

Standard BERT was trained on Wikipedia and BookCorpus — no clinical language. BioClinicalBERT improved this with MIMIC-III pretraining, but its 512-token limit is a hard constraint for longer reports.

ModernBERT is a 2024 re-architecture of BERT with Flash Attention 2, a much longer context window (8192 tokens), and significantly better benchmark performance. BioClinical-ModernBERT brings that architecture to the clinical domain. Longer reports no longer need to be truncated, and attention is computed more efficiently.

Why Matryoshka Representation Learning?

Standard embeddings have a fixed dimension. If you train a 768d model, you’re stuck using 768d at inference. Matryoshka Representation Learning (MRL) trains a single model to produce embeddings that are simultaneously meaningful at multiple nested dimensions — [64, 128, 256, 768] in this project.

The key insight: the first 64 dimensions encode a coarse representation; the first 128 encode a finer one; and so on. You can truncate to any prefix and still get a usable embedding.

For ICD coding this matters because:

Fast first-pass retrieval can use 64d to filter candidates
Re-ranking uses the full 768d for precision
No retraining needed when switching dimension — one model, multiple operating points

The MRL loss is a weighted sum of contrastive losses at each nesting dimension, computed simultaneously during a single forward pass.

Label-Aware Attention (LAA)

Standard mean-pooling aggregates the full sequence into one vector. For multi-label coding, different parts of the report are evidence for different codes. LAA adds a per-label attention mechanism that learns to focus on the clinical tokens most relevant to each candidate code — extracting targeted evidence rather than blending everything.

Ablation study

Three variants were evaluated to isolate the contribution of each component:

Variant	Description
LAA	Full model with Label-Aware Attention
Standard Attention	Same architecture, mean-pool instead of LAA
Retrieval Bi-Encoder	No LAA, standard contrastive training

Evaluation metrics: Micro-F1, ROC-AUC, Precision@5. Experiment tracking via Weights & Biases.

Reflection

The most interesting tension in this project was between retrieval efficiency and clinical precision. MRL resolves it elegantly — you don’t have to choose a fixed operating point at training time, you choose it at inference. That’s a genuine engineering win for production medical systems where latency and accuracy requirements shift by deployment context.

LAA adds complexity but addresses a real structural property of clinical text: a chest X-ray report’s finding about atelectasis is localised to a specific sentence, not spread across the whole document. Making the model aware of that locality is the right inductive bias.