Fork me on GitHub


A tool to clean up text generated by OCR using individual words as well as their context.

301 commits | Last update: August 11, 2019

What ochre can do for you

  • Train character-based language models/LSTMs for OCR post-correction
  • Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
  • Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
  • Analyze OCR errors on the word level
  • Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Read more
  • Text analysis & natural language processing
  • Machine learning
Programming Language
  • Python
  • Apache-2.0
Source code

Participating organizations


  • Janneke van der Zwaan
    Netherlands eScience Center
Contact person
Janneke van der Zwaan
Netherlands eScience Center