Part of the process to train tesseract for a new language. When the character features of all the training pages have been extracted, we need to cluster them to create the prototypes. The character shape features can be clustered using the mftraining and cntraining programs:
fontfile_1.tr fontfile_2.tr ...
This will output two data files: inttemp (the shape prototypes) and pffmtable (the number of expected features for each character). (A third file called Microfeat is also written by this program, but it is not used.)
This manual page documents briefly the
tesseract is a commercial quality OCR engine originally developed at
HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated
by UNLV. It was open-sourced by HP and UNLV in 2005.