CWoMP: Morpheme Representation Learning for Interlinear Glossing

TL;DR: We propose CWoMP, which treats morphemes as atomic form-meaning units with learned representations, enabling interpretable and efficient interlinear glossing that users can improve at inference time by expanding a lexicon — without retraining.

Abstract

Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure…

Method Overview

CWoMP uses a two-stage architecture. The first component is a BoM encoder: a contrastively trained dual encoder (built on a pretrained multilingual encoder) that maps words-in-context and morpheme-gloss pairs into a shared embedding space. It is trained with a multi-positive InfoNCE objective to recognize which morphemes are contained in a given word, producing a discrete codebook of morpheme segments and their glosses.

The second component is an IGT decoder: a lightweight autoregressive transformer that generates the morpheme sequence for each input word by retrieving entries from this codebook. At each step, the decoder's hidden state is scored against all entries in a precomputed lexicon of morpheme embeddings (produced by the frozen BoM encoder), and the nearest neighbor is selected. Because predictions are constrained to codebook entries, the model cannot hallucinate unseen morpheme types. Crucially, users can expand this lexicon at any time without retraining — immediately improving coverage.

BoM Encoder diagram — **BoM Encoder.** The dual encoder learns to embed words and morphemes in a shared space via contrastive learning, minimizing distance between a word and its constituent morphemes.

IGT Decoder diagram — **IGT Decoder.** The autoregressive decoder predicts morphemes as form-meaning units; at each step, the output embedding is matched to its nearest neighbor in the lexicon.

Interactive Glossing Browser

Explore interlinear glossing examples across seven languages. Each example shows the original transcript, translation, morpheme segmentation, and per-method gloss predictions (GT, CWoMP, GlossLM). Prediction cells are color-coded: red = incorrect (unshaded = correct).

Sentence 1 of 1

Loading examples…

Results

We evaluate using Morpheme Error Rate (MER) across seven typologically diverse low-resource languages. CWoMP achieves competitive or superior performance compared to GlossLM across most languages, with consistent improvements in the extended-lexicon setting. Gains are particularly pronounced for mid- to low-resource languages: on Lezgi, CWoMP reduces MER from 0.26 to 0.14–0.19, and on Gitksan—with only 31 training examples—the extended lexicon setting reduces MER from 0.68 to 0.54. Performance in the extended lexicon setting consistently outperforms the train lexicon setting, validating the practical workflow of expanding a lexicon without retraining.

Citation

If you use CWoMP in your work, please cite:

@misc{alper2026cwomp,
  title         = {CWoMP: Morpheme Representation Learning for Interlinear Glossing},
  author        = {Morris Alper and Enora Rice and Bhargav Shandilya and Alexis Palmer and Lori Levin},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv}
}