Linguistics & Decipherment

The Decipherment
Frontier

Five ancient writing systems. Machine learning meets epigraphy. From Herculaneum scrolls to Maya glyphs — tracking the live edge of human knowledge recovery.

𐝂𐝄𐝀𐝋 𓃀𓈖𓌀𓅓 𐦲𐦳𐦴𐦵 𛀀𛀁𛀂𛀃 ΠΟΡΦΥΡΑ

Undeciphered

Partially decoded

Substantially decoded

Active challenge

Herculaneum Papyri

Vesuvius Challenge

🏛️ First Scroll Read Ancient Greek 79 CE burial

PHerc. 1667 — read in full

100%

First complete scroll reading · June 25, 2026

In 79 CE, Mount Vesuvius buried the Villa of the Papyri at Herculaneum under 20 meters of pyroclastic flow. Inside was the only intact library from the classical world — some 1,800 papyrus scrolls, carbonized but preserved in the absence of oxygen. For 275 years after their 18th-century rediscovery, the scrolls were considered unreadable without physical unrolling that would destroy them.

The Vesuvius Challenge, launched in March 2023 by computer scientist Brent Seales, entrepreneur Nat Friedman, and Daniel Gross, changed everything. By releasing high-resolution CT scans and an AI-based ink detection model, the challenge invited the world to decode the scrolls without ever touching them — virtual unwrapping.

In February 2024, three students — Luke Farritor, Youssef Nader, and Julian Schilliger — claimed the $700,000 Grand Prize, deciphering over 2,000 Greek characters. The first word read was ΠΟΡΦΥΡΑ ("purple"). The text is believed to be Philodemus's philosophical writing on pleasure and music.

In February 2025, Oxford's Bodleian scroll PHerc. 172 was imaged at the Diamond Light Source synchrotron, yielding more recoverable text than any previously scanned scroll. The first word decoded: διατροπή — "disgust."

June 25, 2026 — Breakthrough: PHerc. 1667 (Scroll 4) has been completely virtually unwrapped and read from end to end — the first Herculaneum scroll read in full without ever being opened, across roughly 22 columns of Greek text. The recovered content is a Stoic philosophical treatise on ethics — human nature, impulse, and moral progress — whose final preserved column names Aristocreon, nephew and disciple of Chrysippus, placing it in the 2nd century BC. Scanned at the ESRF beamline BM18 in Grenoble. All data and code are open. scrollprize.org/firstscroll ↗ · preprint (PDF) ↗

Competition Milestones

First ink detection on unopened scroll 2023

First word: ΠΟΡΦΥΡΑ ("purple") — Farritor Oct 2023

Grand Prize: 2,000+ characters decoded · $700k Feb 2024

Oxford PHerc. 172 scanned · "disgust" decoded Feb 2025

PHerc. 1667 read in full — first complete scroll · Stoic ethics, 2nd c. BC Jun 2026

Remaining 300+ scanned scrolls — method now proven to scale active

Library size

~1,800 papyrus scrolls recovered. Unexcavated sections of the villa may contain thousands more.

Content recovered

PHerc. 1667: Stoic treatise on ethics naming Aristocreon (disciple of Chrysippus), 2nd c. BC. PHerc. Paris 4: Epicurean Philodemus on pleasure. PHerc. 139: identified as Philodemus, On Gods, Book 8. Scholars hope the broader library holds lost works of Aristotle, Sappho, and Sophocles.

Technology

Phase-contrast X-ray microtomography at ESRF BM18 beamline (Grenoble); TimeSformer-based ML ink detection; automated surface segmentation (ThaumatoAnakalyptor); 3D ink segmentation directly in CT volume.

Open data

All tomographic data, unwrapped surfaces, transcriptions, and code released under Creative Commons at scrollprize.org/data and archived at ESRF. Preprint (June 2026) ↗

AI Developments

🏛️

First full scroll reading — ESRF synchrotron + ML

PHerc. 1667 read end-to-end via phase-contrast X-ray microtomography at ESRF BM18, Grenoble. ML models detected ink on the carbonized surface; papyrologists transcribed ~22 columns. For the first time, ink is directly visible inside the 3D CT volume of PHerc. Paris 4 (Scroll 1) — independently confirming the 2023 Grand Prize reading.

Jun 2026

🤖

TimeSformer model dominates Grand Prize

Video transformer architecture applied to scroll cross-sections outperforms classical CNNs for carbon ink detection on carbonized papyrus.

2024

🔬

ThaumatoAnakalyptor: automated segmentation

New auto-segmentation pipeline reduces cost from $100/cm² manual labor to fully automated surface tracing — critical bottleneck broken.

2025

📜

Egyptian scroll crossover

Nader reports Herculaneum-trained models producing promising results on Egyptian papyri from Berlin — unexpected generalization across ancient collections.

2025

Corpus feed

Loading from pipeline…

Minoan Civilization · Aegean Bronze Age

Linear A

Undeciphered ~1,500 inscriptions 1800–1450 BCE

Deciphered

language unknown · signs partially mapped

Linear A is the undeciphered writing system of the Minoan civilization, used across Crete and the Aegean from roughly 1800 to 1450 BCE. It is the direct ancestor of Linear B — the Mycenaean Greek script cracked by Michael Ventris in 1952 — but the underlying Minoan language remains completely unknown and unrelated to any other language family.

Approximately 1,500 inscriptions survive, mostly administrative tablets from palatial sites like Haghia Triada, Zakros, and Akrotiri on Santorini. The signs can be phonetically read using Linear B correspondences, but the words produced are meaningless because the language itself is unknown — a tantalizing half-reading.

The core obstacle is the small corpus. Statistical and ML approaches require thousands of examples to extract patterns; Linear A's 1,500 short administrative texts offer too little signal. Without a bilingual text — a Minoan Rosetta Stone — purely computational approaches face a fundamental ceiling.

Recent work by researchers including Brent Davis and Silvia Ferrara has mapped sign functions and proposed structural analyses. ML models trained on related Bronze Age scripts have attempted phonetic mapping, but no breakthrough has emerged. Linear A remains one of the last great puzzles of the ancient Mediterranean.

Linear A signs (Haghia Triada tablet HT 31)

𐝀 𐝁 𐝂 𐝃 𐝄 𐝅 𐝆 𐝇

Administrative record — content readable phonetically but semantically opaque

Corpus size

~1,500 inscriptions. Mostly short administrative tallies. No long texts. No bilingual anchor.

Key sites

Haghia Triada (largest archive), Zakros, Knossos, Akrotiri (Thera) — destroyed 1450 BCE, possibly by Thera eruption.

Relationship to Linear B

Linear B borrowed ~80% of Linear A's signs. Ventris's 1952 decipherment of Linear B as Mycenaean Greek made Linear A phonetically readable but semantically silent.

AI challenge

Corpus too small for unsupervised learning. Approaches using cross-script transfer from Linear B show structural insights but no semantic breakthrough.

AI Approaches

🧬

Cross-script transfer learning

Models trained on Linear B, Cypro-Minoan, and Egyptian hieroglyphs applied to Linear A sign classification. Identifies functional categories but cannot assign meaning.

📊

DĀMOS database + computational analysis

University of Oslo's DĀMOS project provides machine-readable Linear A and B corpora. Statistical distribution analyses suggest administrative structure similar to Linear B.

🔗

The Rosetta Stone problem

Without a bilingual text, decipherment requires identifying a known language beneath the signs. Proposals for Luwian, Semitic, and Proto-Greek substrates all remain unverified.

Sinai Peninsula · Levant · c. 1900–1500 BCE

Proto-Sinaitic

Partially Decoded Earliest alphabet ~40 inscriptions

Deciphered

~40%

contested readings · ancestor of all alphabets

Proto-Sinaitic is the oldest known alphabetic writing system — the direct ancestor of every modern alphabet on Earth, from Latin and Greek to Arabic and Hebrew. Developed by Semitic workers in Egyptian turquoise mines at Serabit el-Khadim in the Sinai around 1900–1800 BCE, it borrowed Egyptian hieroglyphic forms but redeployed them as consonantal sound signs for a Semitic language.

The principle — the acrophonic principle — is elegant: the sign for "ox" (ʾaleph in Semitic) represents the sound /ʾ/; the sign for "house" (bayt) represents /b/. This is the origin of our letters A (inverted ox head) and B (floor plan of a house).

Roughly 40 short inscriptions survive, mostly from Sinai but with related examples at sites across the Levant. The readings are contested — scholars agree on the basic phonetic values but disagree sharply on word identification, language affiliation, and content. Recent work by Thomas Schneider, Brian Colless, and others has proposed new readings that challenge the 20th-century consensus.

The tiny corpus and lack of longform text means ML approaches are largely inapplicable here. The frontier is classical philological and epigraphic work, now accelerated by digital imaging — RTI (Reflectance Transformation Imaging) and photogrammetric 3D models revealing previously invisible signs on deteriorated surfaces.

Discovery

Found by Flinders Petrie at Serabit el-Khadim, Sinai, in 1904. First decipherment proposals by Alan Gardiner in 1916.

Descendants

Proto-Canaanite → Phoenician → Greek, Latin, Cyrillic, Arabic, Hebrew, Ethiopic — essentially every alphabet in use today.

Key inscription

Sphinx inscription from Serabit reads (possibly) "to Baalat" — a Semitic goddess. The most agreed-upon reading in the corpus.

AI role

RTI and photogrammetric imaging recover signs invisible to naked eye. ML used for sign classification and comparison across the 40-inscription corpus.

Current Research

📷

RTI + photogrammetry revival

New digital imaging at Serabit el-Khadim and the British Museum is recovering previously illegible signs, effectively expanding the corpus for the first time in decades.

🔤

Wadi el-Hol inscriptions

John and Deborah Darnell's discovery in 1993 of early alphabetic inscriptions in Upper Egypt push the script's origin possibly to 1900 BCE — earlier than the Sinai examples.

Indus Valley Civilization · 2600–1900 BCE

Indus Valley Script

Undeciphered ~4,000 inscriptions 417 distinct signs

Deciphered

language unknown · largest undeciphered corpus

The Indus Valley script is the most widely studied undeciphered writing system on Earth. Used by the Harappan civilization — which at its height encompassed more territory than Mesopotamia and Egypt combined — it appears on roughly 4,000 objects, primarily small stamp seals and pottery sherds. The inscriptions are frustratingly short, averaging just 5 signs.

The language beneath the script is unknown and unattested. The major candidates are an ancestor of the Dravidian language family (Tamil, Telugu, Kannada) or an undocumented language isolate. A minority view holds the "script" is not language at all but a logo-administrative system of pure symbols — though most scholars reject this.

The 417 distinct signs suggest a logosyllabic or logo-consonantal system. Statistical analyses by Rajesh Rao (2009) found that Indus sign sequences have conditional entropy values consistent with linguistic systems — arguing against the "non-linguistic" hypothesis. This work sparked renewed ML interest in the script.

Recent deep learning approaches, including work by Nisha Yadav and teams at TIFR, have mapped sign distributions and proposed syntactic rules. The 2021 South Asian inscription from Khirbat Hamra Ifdan in Jordan, referencing trade goods, hints at Indus-Mesopotamian contact — but no bilingual anchor has emerged.

Corpus

~4,000 inscriptions. Average length: 5 signs. Longest known: 26 signs. All short — no "Rosetta Stone" equivalent exists.

Major sites

Mohenjo-daro, Harappa, Dholavira — the three largest Harappan cities. Dholavira's signboard inscription (10 large signs) is the most monumental.

Language candidates

Proto-Dravidian (majority view among scholars), Munda language ancestor, or an isolate. No consensus. No living descendant language confirmed.

AI frontier

Rajesh Rao's entropy analysis (2009). TIFR computational studies. Most recently, transformer models trained on Dravidian scripts for cross-script feature comparison.

AI Approaches

📈

Entropy analysis confirms linguistic nature

Rao et al. (2009) in Science showed Indus sign conditional entropy matches known languages and differs from non-linguistic systems — strongest statistical argument for it being language.

🔗

Dravidian computer model (Mahadevan)

Iravatham Mahadevan's concordance of 3,000+ inscriptions forms the primary computational dataset. Tamil-Brahmi comparisons identify possible loanword correspondences.

🧠

LLM attempts (2023–2024)

Multiple teams have fine-tuned LLMs on proposed Dravidian phonetic mappings. Results remain speculative — without a bilingual text, any "reading" is unfalsifiable.

Mesoamerica · 300 BCE – 1697 CE

Maya Hieroglyphs

85–90% Decoded 🔴 AI Frontier Logosyllabic

Deciphered

85–90%

reading dynastic histories · AI segmentation active

Maya hieroglyphic writing is the only pre-Columbian writing system substantially deciphered — and the decipherment story is one of the great intellectual dramas of the 20th century. From complete mystery in the 1950s to reading dynastic histories, political betrayals, and philosophical texts by the 1990s, the breakthrough took 40 years and required integrating art history, linguistics, epigraphy, and anthropology in ways that no single discipline could achieve alone.

The key insight came from Soviet scholar Yuri Knorozov in 1952: the Maya script is logosyllabic, combining logograms (word signs) and syllabograms (sound signs), similar in structure to Egyptian or Sumerian cuneiform. Combined with Tatiana Proskouriakoff's 1959 discovery that inscriptions record historical events — not astronomy or prophecy — the script yielded its secrets rapidly.

Today, roughly 85–90% of surviving glyphs can be read. Scholars like David Stuart, Stephen Houston, and Simon Martin continue to refine readings. The corpus spans stelae at Tikal, Palenque, Copan, and Yaxchilán; the four surviving codices (Dresden, Madrid, Paris, Grolier); and vast quantities of ceramic vessel texts. Each newly excavated site adds inscriptions — Palenque's Temple of the Inscriptions alone contains more text than all Egyptian royal inscriptions from Ramesses II.

The remaining 10–15% includes signs still poorly understood, regional scribal variants, and the challenge of the 1,400+ sign inventory — making this the AI frontier: not decipherment from scratch, but computational assistance with the still-opaque remainder, automated glyph segmentation, and cross-corpus pattern analysis at scale.

Active AI Projects

🎨

WVU Artful Algorithms Project

West Virginia University collaboration between Art History and Computer Science. Fine-tuning SAM (Segment Anything Model) on the Justin Kerr MayaVase database — 5,000+ rollout photographs of Maya ceramic vessels with expert-annotated glyph blocks.

2024 — ongoing

🏛️

TWKM / IDIOM Digital Corpus

Bonn-Bochum Maya epigraphy project building a TEI/XML machine-readable corpus of Classic Maya texts. The IDIOM research environment hosts 1,400+ sign catalog, transliterations, and dynamic RDF database incorporating Mathews's 40-year Maya History Project dataset.

2024 — ongoing · classicmayan.org

🔤

Unicode encoding proposal

UC Berkeley Script Encoding Initiative working to encode Maya hieroglyphs in Unicode (tentative range U+15500–U+159FF). As of 2024 still under development — the logosyllabic flexibility (infixes inside signs) challenges standard Unicode encoding models.

2016 grant — 2024 in progress

📐

Living epigraphy: new glyphs invented

Yucatec epigraphers like Eduardo Puga are creating new Maya glyphs for modern concepts — laptop, giraffe, democracy — using authentic acrophonic and phonetic conventions. The script is being extended by living practitioners for the first time in 500 years.

2024

Script type

Logosyllabic: combines logograms (word signs) with syllabograms (CV syllables). Similar structure to Egyptian hieroglyphs and Sumerian cuneiform.

Corpus size

Thousands of monumental inscriptions. Four surviving codices. Vast ceramic vessel corpus. New texts emerge with every excavation season.

Key decipherment

Knorozov (1952) proves logosyllabic structure. Proskouriakoff (1959) proves historical content. Schele, Houston, Stuart (1970s–90s) crack phonetic readings at scale.

What remains

~10–15% of signs still uncertain. Regional variants. The full Dresden Codex astronomical tables. New LiDAR discoveries at sites like Aguada Fénix contain unread texts.

mayadecipherment.com

David Stuart's active blog — the primary venue for new glyph readings. New posts appear regularly with fresh decipherments from fieldwork and museum collections.

The "Next Frontier" Framing

Logosyllabic structure proven (Knorozov) 1952

Historical content confirmed (Proskouriakoff) 1959

~85% readable: dynastic histories recovered 1990s

MayaVase database digitized (Kerr / Dumbarton Oaks) 2022

AI glyph segmentation (WVU, TWKM) — active 2024–

Unicode encoding proposal — active 2024–

Full corpus machine-readable · last 10–15% cracked future

Corpus feed

Loading from pipeline…

Linguistic Research Tools

Annotation

ELAN (MPI Nijmegen)

Time-aligned multi-tier annotation of audio/video. Gold standard for endangered language documentation. Exports to JSON, FLEx, Praat TextGrid. Free/open source.

Lexicon & Morphology

FieldWorks Language Explorer (FLEx)

SIL tool for lexicon, morphology, and interlinear text. Exports .flextext for web display via LingView pipeline.

Corpus Query

KorAP / Khepri

Corpus query platforms for large-scale linguistic analysis. KorAP handles billion-word corpora with complex morphosyntactic queries.

The Decipherment
Frontier

Vesuvius Challenge

Linear A

Proto-Sinaitic

Indus Valley Script

Maya Hieroglyphs

Linear Elamite

Coptic Magical Papyri

Living Languages at Risk

WALS Online

Cuneify

Linguistic Research Tools

The DeciphermentFrontier

Vesuvius Challenge

Linear A

Proto-Sinaitic

Indus Valley Script

Maya Hieroglyphs

Linear Elamite

Coptic Magical Papyri

Living Languages at Risk

WALS Online

Cuneify

Linguistic Research Tools

The Decipherment
Frontier