DE102018119908A1

DE102018119908A1 - Optical Character Recognition (OCR) system

Info

Publication number: DE102018119908A1
Application number: DE102018119908.2A
Authority: DE
Inventors: Gerald Schreiber; Joachim Bauer; Ciprian Dinu; Richard Kolodziej; Thomas Lissowski
Original assignee: CCS Content Conversion Specialists GmbH
Current assignee: CCS Content Conversion Specialists GmbH
Priority date: 2018-08-16
Filing date: 2018-08-16
Publication date: 2020-02-20

Abstract

Ein System (1) zur optischen Zeichenerkennung (OCR) aus einem Text enthaltenden Bild, soll Transkriptionsfehler minimieren und die Qualität des resultierenden maschinencodierten Textes verbessern. Dazu umfasst es:
- ein Textbereich-Dispatcher-Modul (6), das mit zumindest einer OCR-Engine (aTR1, aTR2,...aTRn) verbunden ist und so konfiguriert ist, dass es das Bild als Eingangsdaten zu der OCR-Engine (aTR1, aTR2,...aTRn) zuweist und maschinencodierte Texte als Ausgangsdaten von der OCR-Engine (aTR1, aTR2,...aTRn) empfängt, und
- ein Textinformations-Verwaltungssystem-Modul (14), das so konfiguriert ist, dass es das Bild und den maschinencodierten Text in Paare von entsprechenden Bildsegmenten und Textsegmenten segmentiert, die Paare einer Vielzahl von Benutzern (mTR) zur Überprüfung zuweist und korrigierte Textsegmente von den Benutzern (mTR) empfängt.

A system (1) for optical character recognition (OCR) from an image containing text is intended to minimize transcription errors and to improve the quality of the resulting machine-coded text. It includes:
- A text area dispatcher module (6) which is connected to at least one OCR engine (aTR1, aTR2, ... aTRn) and is configured such that it uses the image as input data to the OCR engine (aTR1, aTR2 , ... aTRn) and receives machine-coded texts as output data from the OCR engine (aTR1, aTR2, ... aTRn), and
- a text information management system module (14) configured to segment the image and machine-coded text into pairs of corresponding image segments and text segments, assign the pairs to a plurality of users (mTR) for review, and corrected text segments from those Users (mTR) receives.

Description

Die Erfindung betrifft ein System zur optischen Zeichenerkennung (OCR) aus einem Bild, das Text enthält.The invention relates to a system for optical character recognition (OCR) from an image that contains text.

Optische Zeichenerkennung (auch Optical Character Reader, OCR) ist die mechanische oder elektronische Umwandlung von Bildern von getipptem, handgeschriebenem oder gedrucktem Text in maschinenkodierten Text, sei es von einem gescannten Dokument, einem Foto eines Dokuments, einem Szenenfoto (z.B. der Text auf Schildern und Plakaten in einem Landschaftsfoto) oder von einem Untertiteltext, der einem Bild überlagert ist (z.B. aus einer Fernsehsendung).Optical character recognition (also Optical Character Reader, OCR) is the mechanical or electronic conversion of images from typed, handwritten or printed text into machine-coded text, be it from a scanned document, a photo of a document, a scene photo (e.g. the text on signs and Posters in a landscape photo) or from a subtitle text that is superimposed on a picture (e.g. from a television program).

OCR ist eine gängige Methode, gedruckte Texte zu digitalisieren, so dass sie elektronisch bearbeitet, durchsucht, kompakter gespeichert, online angezeigt und in maschinellen Prozessen wie kognitiven Berechnungen, maschineller Übersetzung, (extrahiertem) Text-to-Speech, Schlüsseldaten und Text-Mining verwendet werden können. OCR ist ein Forschungsgebiet in den Bereichen Mustererkennung, künstliche Intelligenz und Computer Vision.OCR is a common way to digitize printed text so that it is electronically edited, searched, stored more compactly, displayed online and used in machine processes such as cognitive calculations, machine translation, (extracted) text-to-speech, key data and text mining can be. OCR is a research area in the fields of pattern recognition, artificial intelligence and computer vision.

Aktuelle OCR-Systeme sind in der Lage, eine hohe Erkennungsgenauigkeit für die meisten Schriften zu erzielen und unterstützen eine Vielzahl von digitalen Bilddateiformaten. Einige Systeme sind in der Lage, formatierte Ausgaben zu reproduzieren, die der Originalseite mit Bildern, Spalten und anderen nicht-textuellen Komponenten sehr nahe kommen.Current OCR systems are able to achieve high recognition accuracy for most fonts and support a large number of digital image file formats. Some systems are able to reproduce formatted output that comes very close to the original page with images, columns and other non-textual components.

Obwohl die Technologie in den letzten Jahren sehr weit fortgeschritten ist, sind Transkriptionsfehler dennoch unvermeidlich. Dies gilt insbesondere bei exotischen Schriften oder schlechter Scanqualität. Handschrift kann nur mit Einschränkungen mit hohem Lernaufwand und hoher Fehlerquote per OCR verarbeitet werden. Transkriptionsfehler werden manchmal einfach akzeptiert, da die manuelle Korrektur eine langwierige, zeitraubende und teure Aufgabe ist.Although the technology has advanced a lot in recent years, transcription errors are still inevitable. This applies in particular to exotic fonts or poor scan quality. Handwriting can only be processed via OCR with restrictions with a high learning effort and a high error rate. Transcription errors are sometimes simply accepted because manual correction is a lengthy, time-consuming and expensive task.

Aufgabe der vorliegenden Erfindung ist es daher, ein System zur optischen Zeichenerkennung aus einem Bild mit Text bereitzustellen, das Transkriptionsfehler minimiert und die Qualität des resultierenden maschinencodierten Textes verbessert.The object of the present invention is therefore to provide a system for optical character recognition from an image with text, which minimizes transcription errors and improves the quality of the resulting machine-coded text.

Diese Aufgabe wird erfindungsgemäß gelöst, indem das System umfasst:

- ein Textbereich-Dispatcher-Modul, das mit zumindest einer OCR-Engine verbunden ist und so konfiguriert ist, dass es das Bild als Eingangsdaten zu der OCR-Engine zuweist und maschinencodierte Texte als Ausgangsdaten von der OCR-Engine empfängt, und
- ein Textinformations-Verwaltungssystem-Modul, das so konfiguriert ist, dass es das Bild und den maschinencodierten Text in Paare von entsprechenden Bildsegmenten und Textsegmenten segmentiert, die Paare einer Vielzahl von Benutzern zur Überprüfung zuweist und korrigierte Textsegmente von den Benutzern empfängt.

According to the invention, this object is achieved in that the system comprises:

a text area dispatcher module which is connected to at least one OCR engine and is configured such that it assigns the image as input data to the OCR engine and receives machine-coded texts as output data from the OCR engine, and
a text information management system module configured to segment the image and machine-coded text into pairs of corresponding image segments and text segments, assigning pairs to a plurality of users for review, and receiving corrected text segments from the users.

Die Erfindung geht dabei von der Überlegung aus, dass eine manuelle Nachbearbeitung automatisch erkannter OCR-Resultate erfolgen sollte. Hierbei ergibt sich aber das Problem, dass die manuelle Nachbearbeitung eines vollständigen Dokuments zwar qualitativ hochwertig, aber aufwändig und teuer ist. Ein Rückgriff auf freiwillige Nutzer ist zwar günstiger, die Qualität wird jedoch schlechter sein. Zudem ergeben sich Probleme urheberrechtlicher Natur sowie hinsichtlich der Vertraulichkeit. Zur Lösung der oben genannten Probleme wird eine Segmentierung der Bild- und Textsegmente durchgeführt, d.h. Bild und Text werden in jeweils zusammengehörige Teile unterteilt, wobei jedes Paar einen Ausschnitt des bildlichen Textes sowie die maschinenlesbare Transkription dieses Ausschnittes enthält. Diese segmentierten Paare werden dann auf eine Vielzahl von Benutzern zur manuellen Nachüberprüfung verteilt und von diesen korrigiert.The invention is based on the consideration that manual processing of automatically recognized OCR results should take place. However, the problem arises here that the manual post-processing of a complete document is of high quality, but is complex and expensive. Using voluntary users is cheaper, but the quality will be worse. There are also problems of copyright and confidentiality. To solve the problems mentioned above, segmentation of the image and text segments is carried out, i.e. Image and text are divided into parts that belong together, with each pair containing a section of the pictorial text and the machine-readable transcription of this section. These segmented pairs are then distributed to and corrected by a large number of users for manual review.

In vorteilhafter Ausgestaltung erfolgt das Segmentieren dabei auf Satzebene. Eine derartige Segmentierung ist zum Einen technisch gut zu bewerkstelligen, zum Anderen erleichtert sie die Überprüfung für die verteilten Benutzer.In an advantageous embodiment, the segmentation takes place at the sentence level. Such a segmentation is technically easy to do, on the one hand, and on the other hand it makes it easier for the distributed users to check it.

In weiterer vorteilhafter Ausgestaltung umfasst das System ferner ein Wörterbuch-Modul, wobei das Textinformations-Verwaltungssystem-Modul weiter konfiguriert ist, um die korrigierten Textsegmente in dem Wörterbuch-Modul zu überprüfen. Hierdurch wird eine doppelte Überprüfung der Texterkennung erreicht: Die automatisch von der OCR-Engine erkannten Segmente werden manuell durch die Benutzer überprüft, und die Korrekturen der Benutzer werden anhand eines umfassenden Wörterbuchs erneut überprüft, d.h. es wird abgeglichen, ob die von den Benutzern ermittelten Wörter im Wörterbuch existieren. Ist dies nicht der Fall, kann ggf. eine weitere Überprüfung, z.B. durch einen weiteren, anderen Benutzer angestoßen werden.In a further advantageous embodiment, the system further comprises a dictionary module, the text information management system module being further configured to check the corrected text segments in the dictionary module. This achieves a double check of the text recognition: the segments automatically recognized by the OCR engine are checked manually by the users, and the corrections of the users are checked again using a comprehensive dictionary, i.e. it is compared whether the words determined by the users exist in the dictionary. If this is not the case, a further check, e.g. be triggered by another, different user.

In einer ersten Ausführungsform umfasst das System ein Schnittstellen-Modul, welches eine Mensch-Maschine-Schnittstelle bereitstellt, auf welchem die Benutzer ihnen zugewiesene Paare anzeigen und Korrekturen eingeben können. Ein solches Schnittstellen-Modul kann beispielsweise auf einem Desktop-Computer installiert sein, empfängt die Bild- und Textsegmente und stellt sie dem Benutzer auf der lokal erzeugten Mensch-Maschine-Schnittstelle zur Verfügung. Besonders vorteilhaft ist das Schnittstellen-Modul dabei aber ein Webserver-Modul und die Mensch-Maschine-Schnittstelle eine Web-Schnittstelle. Das System ist also für einen Online-Zugang ausgestaltet, so dass Benutzer sich z.B. mit Namen und Passwort auf der Web-Schnittstelle anmelden können und dort online die Bild- und Textsegmente anzeigen können. Für die Korrektur sind entsprechende Eingabefelder vorgesehen.In a first embodiment, the system comprises an interface module which provides a human-machine interface on which the users can display pairs assigned to them and enter corrections. Such an interface module can be installed on a desktop computer, for example, receives the image and text segments and makes them available to the user on the locally generated human-machine interface. It is particularly advantageous the interface module is a web server module and the human-machine interface is a web interface. The system is therefore designed for online access, so that users can log on to the web interface with their name and password, for example, and can display the image and text segments there online. Corresponding input fields are provided for the correction.

In einer zweiten, alternativen oder zusätzlichen Ausführungsform umfasst das System ein Schnittstellen-Modul, welches so konfiguriert ist, dass es eine Mehrzahl von jeweils einem Benutzer zugewiesenen Paaren an ein mobiles Endgerät des Benutzers überträgt, und auf dem mobilen Endgerät erzeugte korrigierte Textsegmente empfängt. In dieser Ausführungsform kann die Korrektur offline erfolgen, d.h. auf dem Endgerät des Benutzers, d.h. einen Smartphone, Tablet, Laptop o.Ä. ist eine geeignete App installiert, die eine Mehrzahl von Textsegmenten vom Schnittstellen-Modul empfängt und speichert. Der Benutzer kann die oben zur Mensch-Maschine-Schnittstelle beschriebenen Funktionen dann offline in der App ausführen, wobei die App die Mensch-Maschine-Schnittstelle erzeugt und das Schnittstellen-Modul in diesem Fall die Kommunikation mit der App übernimmt. Auch in diesem Fall erfolgt die Kommunikation vorteilhaft über das Internet, so dass das Schnittstellen-Modul hier auch vorteilhafterweise als Webserver-Modul ausgestaltet ist. Die Korrekturen werden in der App gespeichert und zu einem späteren Zeitpunkt gesammelt an das Schnittstellen-Modul übertragen. Hierdurch kann die Arbeit der Benutzer auch erfolgen, wenn keine durchgehende Online-Konnektivität besteht.In a second, alternative or additional embodiment, the system comprises an interface module, which is configured in such a way that it transmits a plurality of pairs, each assigned to a user, to a mobile terminal of the user and receives corrected text segments generated on the mobile terminal. In this embodiment, the correction can be done offline, i.e. on the user's terminal, i.e. a smartphone, tablet, laptop or similar a suitable app is installed that receives and stores a plurality of text segments from the interface module. The user can then carry out the functions described above for the human-machine interface offline in the app, the app generating the human-machine interface and in this case the interface module taking over the communication with the app. In this case, too, the communication advantageously takes place via the Internet, so that the interface module here is also advantageously designed as a web server module. The corrections are saved in the app and later transferred to the interface module. This allows users to work even when there is no continuous online connectivity.

Die eingangs genannte Aufgabe wird dementsprechend weiterhin gelöst durch ein Computerprogrammprodukt, d.h. beispielsweise eine App für ein Smartphone, Tablet oder Laptop, das in den Speicher mindestens eines mobilen Endgeräts ladbar ist und Software-Codeabschnitte enthält, die das mobile Endgerät veranlassen, eine Mehrzahl von dem Benutzer des Endgeräts zugewiesenen Paaren von Bildsegmenten und Textsegmenten von einem Schnittstellen-Modul zu empfangen und auf einem Anzeigeelement des Endgeräts darzustellen, Korrekturen des Textsegments durch eine Benutzerschnittstelle des Endgeräts zu empfangen und korrigierte Textsegmente an das Schnittstellen-Modul zu senden.Accordingly, the above-mentioned object is still achieved by a computer program product, i.e. For example, an app for a smartphone, tablet or laptop that can be loaded into the memory of at least one mobile terminal and contains software code sections that cause the mobile terminal to display a plurality of pairs of image segments and text segments assigned to the user of the terminal by an interface. Receive module and display on a display element of the terminal, receive corrections to the text segment through a user interface of the terminal and send corrected text segments to the interface module.

In weiterer vorteilhafter Ausgestaltung umfasst das System nach einem der vorhergehenden Ansprüche, weiter

- ein Textbereich-Vorverarbeitungsmodul, das konfiguriert ist, um dem Bild eine Anzahl von Attributen zuzuweisen,
- ein Benutzerverwaltungsmodul, in dem Profile einer Mehrzahl von Benutzern gespeichert sind,

wobei das Textbereichsvorbereitungsmodul ferner so konfiguriert ist, dass es die Paare auf Basis eines Abgleichs der Attribute mit den Profilen zuweist. Das heißt, es wird eine Voranalyse des Textes durchgeführt, die Attribute z.B. zu Typ, Stil, Sprache, Layout etc. des Textes im Bild zuordnet. Zusätzlich werden in Profilen persönliche Informationen zu jedem Benutzer gespeichert, die eine Einschätzung erlauben, für welchen Typ, Stil, Sprache, Layout etc. der jeweilige Benutzer besonders geeignet ist. Durch einen Abgleich dieser Informationen kann dann ein Benutzer ermittelt werden, der für die Korrektur des Textes besonders geeignet ist.In a further advantageous embodiment, the system according to one of the preceding claims, further

a text area preprocessing module configured to assign a number of attributes to the image,
a user management module in which profiles of a plurality of users are stored,

wherein the text area preparation module is further configured to assign the pairs based on a matching of the attributes with the profiles. This means that a pre-analysis of the text is carried out, which assigns attributes to the type, style, language, layout, etc. of the text in the image. In addition, personal information about each user is stored in profiles, which allows an assessment of the type, style, language, layout, etc. for which the respective user is particularly suitable. A comparison of this information can then be used to determine a user who is particularly suitable for correcting the text.

Die Profile enthalten dabei vorteilhafterweise eine oder mehrere Informationen aus der folgenden Gruppe: Sprachniveau, Fähigkeit, spezielle Schriftarten zu lesen, bevorzugte Arbeitszeit und Qualitätsbewertung. Hierdurch lassen sich Benutzer ermitteln, die die entsprechende Sprache des Textes beherrschen (dies auch in einem geeigneten Niveau), die die Schriftart des Textes gut lesen können, deren Arbeitszeiten so sind, dass eine baldige Überprüfung gewährleistet ist, etc. Zudem können Benutzer, die eine bessere Qualität von Transkriptionen abliefern, bei der Zuweisung bevorzugt werden.The profiles advantageously contain one or more information from the following group: language level, ability to read special fonts, preferred working hours and quality rating. In this way, users can be identified who speak the appropriate language of the text (even at a suitable level), who can read the font of the text well, whose working hours are such that a quick check is guaranteed, etc. In addition, users who deliver a better quality of transcriptions, with preference given to assignment.

In weiterer vorteilhafter Ausgestaltung des Systems ist das Textinformations-Verwaltungssystem-Modul weiterhin so konfiguriert, dass es ein Paar einer Mehrzahl von Benutzern zuweist, und auf Basis eines Abgleichs der Mehrzahl von korrigierten Textsegmenten der Mehrzahl von Benutzern eine Qualitätsbewertung im Profil verändert. Durch die Zuweisung eines Paars an mehrere Benutzer kann zum Einen beispielsweise die Konfidenz bezüglich der Transkription weiter erhöht werden. Wenn mehrere Benutzer bestätigen, dass eine Transkription korrekt ist, steigt nämlich entsprechend die Wahrscheinlichkeit, dass dies tatsächlich auch so ist. Darüber hinaus kann durch eine mehrfache Verteilung die Qualität von Benutzern überprüft werden: Haben Benutzer die so ermittelte richtige Transkription bestätigt oder eine falsche Transkription richtig korrigiert, so kann deren Qualitätsniveau im Profil erhöht werden. Umgekehrt kann das Qualitätsniveau eines Benutzers gesenkt werden, wenn dieser eine falsche Transkription nicht erkannt hat.In a further advantageous embodiment of the system, the text information management system module is further configured such that it assigns a pair to a plurality of users and changes a quality rating in the profile on the basis of a comparison of the plurality of corrected text segments of the plurality of users. By assigning a pair to several users, on the one hand, for example, the confidence regarding the transcription can be further increased. If several users confirm that a transcription is correct, the probability increases that this is actually the case. In addition, the quality of users can be checked by multiple distribution: If users have confirmed the correct transcription determined in this way or corrected an incorrect transcription correctly, their quality level in the profile can be increased. Conversely, the quality level of a user can be reduced if the user has not recognized an incorrect transcription.

Ein Computerprogrammprodukt ist vorteilhafterweise in den Speicher mindestens eines Computers ladbar und enthält Software-Codeabschnitte, die den Computer veranlassen, die Funktion eines oben beschriebenen Systemsauszuführen.A computer program product is advantageously loadable into the memory of at least one computer and contains pieces of software code that cause the computer to perform the function of a system described above.

Die Vorteile der Erfindung liegen insbesondere darin, dass durch die Verteilung einer automatisierten Aufgabe, nämlich der OCR-Transkription auf mehrere Systeme und/oder Personen, Fehler einzelner Systeme eliminiert und ein Konfidenzniveau der resultierenden Stimme abgezogen werden kann. Die Qualität der Transkription wird erhöht. The advantages of the invention are, in particular, that by distributing an automated task, namely the OCR transcription over several systems and / or persons, errors of individual systems can be eliminated and a confidence level of the resulting voice can be deducted. The quality of the transcription is increased.

Ausführungsformen der Erfindung werden in Bezug auf Zeichnungen näher beschrieben, in denen die einzigeEmbodiments of the invention are described in more detail with reference to drawings, in which the only one

FIG eine schematische Darstellung eines beispielhaften Systems zur optischen Zeichenerkennung zeigt.FIG shows a schematic representation of an exemplary system for optical character recognition.

Das in der FIG dargestellte Diagramm zeigt ein System zur optischen Zeichenerkennung 1, das eine dynamische Mischung aus automatischer und manueller Transkription anwendet. Die Module des Systems 1 werden durch Computerverarbeitung realisiert, d.h. sie stellen Module eines Computerprogramms dar, das in den Speicher eines oder mehrerer Hardwaregeräte geladen wird, so dass das Hardwaregerät die für jedes Modul des Systems 1 beschriebenen automatisierten Schritte durchführt. Die Hardware kann lokalisiert oder verteilt, stationär oder mobil oder sogar als Cloud-basiertes System realisiert werden.The diagram shown in the FIG shows a system for optical character recognition 1 which uses a dynamic mix of automatic and manual transcription. The modules of the system 1 are realized by computer processing, ie they represent modules of a computer program that is loaded into the memory of one or more hardware devices, so that the hardware device is responsible for each module of the system 1 performs the automated steps described. The hardware can be localized or distributed, stationary or mobile, or even implemented as a cloud-based system.

Beim System 1 wird die Eingabe in Form eines gescannten Textdokuments empfangen. Gängige Dateiformate sind pdf, jpg, tiff, jp2, etc. Das Scannen von Texten ist ein branchenübliches Verfahren und liegt außerhalb des Rahmens dieser Beschreibung. Der Fachmann weiß, wie man Scans durchführt, um digitale grafische Darstellungen von Dokumenten in den oben genannten oder anderen Grafikdateiformaten zu erhalten.With the system 1 the input is received in the form of a scanned text document. Common file formats are pdf, jpg, tiff, jp2, etc. The scanning of texts is an industry-standard procedure and is outside the scope of this description. Those skilled in the art know how to perform scans to obtain digital graphic representations of documents in the above or other graphic file formats.

In einem ersten Schritt der Texterkennung wird eine Voranalyse in einem Textbereich-Analysator-Modul 2 (Text Area Analyzer (TAA) Modul) durchgeführt. Das TAA-Modul 2 generiert Attribute, die sich sowohl auf das gesamte Dokument als auch auf Teile des Dokuments beziehen. Attribute, die sich auf das gesamte Dokument beziehen, sind z.B. die Dokumentart (Buch, Zeitung, etc.). Darüber hinaus führt das TAA-Modul 2 eine Dokumentsegmentierung durch, d.h. es teilt das Bild der Seite in verschiedene Bereiche mit unterschiedlichen Eigenschaften auf wie z.B.: Textbereiche, Illustrationsbereiche, mathematische/chemische Formelbereiche, handgeschriebene Textbereiche usw. Diese Merkmale werden als Attribute dargestellt und den einzelnen Bereichen als Metadaten zugeordnet.In a first step of text recognition, a pre-analysis is carried out in a text area analyzer module 2 (Text Area Analyzer (TAA) module). The TAA module 2 generates attributes that apply to both the entire document and parts of the document. Attributes that relate to the entire document are, for example, the document type (book, newspaper, etc.). In addition, the TAA module leads 2 document segmentation, ie it divides the image of the page into different areas with different properties such as: text areas, illustration areas, mathematical / chemical formula areas, handwritten text areas etc. These characteristics are shown as attributes and assigned to the individual areas as metadata.

In alternativen Ausführungen ist das TAA-Modul 2 als Kombination aus automatischer Analyse und manueller (menschlicher) Qualitätssicherung und -korrektur implementiert. D.h. das TAA-Modul 2 ist für die menschliche Interaktion über eine Benutzeroberfläche konzipiert, um die Qualität der Analyse durch das TAA-Modul 2 zu verbessern. Dies ist vorteilhaft bei sehr heterogenen Layouts (z.B. Titelseiten von Zeitschriften) oder schlechter Scanqualität (z.B. alte Micro-Fiches).The TAA module is an alternative 2 implemented as a combination of automatic analysis and manual (human) quality assurance and correction. Ie the TAA module 2 is designed for human interaction through a user interface to the quality of analysis by the TAA module 2 to improve. This is advantageous in the case of very heterogeneous layouts (eg magazine covers) or poor scan quality (eg old micro fiches).

In der in der FIG gezeigten Ausführungsform wird ein nachfolgendes Textbereich-Vorverarbeitungsmodul 4 (Text Area Pre-Processor (TAPP) Modul) eingesetzt, um Bereiche zu filtern und vorzuverarbeiten, die im TAA Modul 2 als Textinformationen identifiziert wurden. Diese Vorverarbeitung kann Bildverbesserungen, Kontrastanpassung oder Binärisierung beinhalten und die Qualität der Texterkennung verbessern. Das TAPP-Modul 4 ist jedoch optional und kann in anderen Ausführungen nicht vorhanden sein.In the embodiment shown in the FIG, a subsequent text area preprocessing module 4 (Text Area Pre-Processor (TAPP) module) used to filter and preprocess areas in the TAA module 2 identified as text information. This preprocessing can include image enhancement, contrast adjustment or binarization and improve the quality of text recognition. The TAPP module 4 is optional, however, and may not be available in other versions.

Ein nachfolgendes Textbereich-Dispatcher-Modul 6 (Text Area Dispatcher (TAD) Modul) verwendet dann OCR-Engines (Optical Character Recognition), um den in den Textbereichen enthaltenen Text zu transkribieren. Dazu wählt das TAD-Modul 6 für jeden Textbereich ein automatisiertes System zur optischen Zeichenerkennung (aTR1, aTR2,...aTRn) aus einer Sammlung optischer Zeichenerkennungs-Engines 8. Dazu werden die Bildbereiche und Metadaten der jeweiligen Dokumente vom TAD-Modul 6 zur optimierten Zuordnung der Transkription zu den nachfolgenden automatisierten Systemen zur optischen Zeichenerkennung (aTR1, aTR2,...aTRn) verwendet. Das TAD-Modul verwendet einen selbstlernenden Algorithmus (z.B. neuronale Netze), um die Zuordnung auf Basis der bisherigen individuellen aTR (automatic Text Recognition)-Ergebnisse schrittweise zu verbessern.A subsequent text area dispatcher module 6 (Text Area Dispatcher (TAD) module) then uses OCR (Optical Character Recognition) engines to transcribe the text contained in the text areas. The TAD module chooses 6 an automated system for optical character recognition for each text area ( ATR1 . ATR2 ... ATRN ) from a collection of optical character recognition engines 8th , For this purpose, the image areas and metadata of the respective documents from the TAD module 6 for optimized assignment of the transcription to the subsequent automated systems for optical character recognition ( ATR1 . ATR2 ... ATRN ) used. The TAD module uses a self-learning algorithm (e.g. neural networks) to gradually improve the assignment based on the previous individual aTR (automatic text recognition) results.

Die aTR-Engines sind verschiedene OCR-Engines in Industriequalität. Auf dem Markt gibt es mehrere Systeme, die eine breite Palette von Schriften, Sprachen und Layouts abdecken (z.B. Tesseract, ABBYY Fine Reader). Andere Systeme wurden für sehr spezifischen Input (Eingangsdaten) entwickelt, z.B. Chinesisch, Arabisch, Mittelalterliches Latein, etc. Typischerweise sind die OCR-Engines selbst komplexe Systeme mit ihren eigenen Stärken und Schwächen. Daher führt die Verarbeitung desselben Inputs durch verschiedene Systeme in der Regel nicht zu einem identischen Output (Ausgangsdaten).The aTR engines are various industrial quality OCR engines. There are several systems on the market that cover a wide range of fonts, languages and layouts (e.g. Tesseract, ABBYY Fine Reader). Other systems have been developed for very specific input (input data), e.g. Chinese, Arabic, Medieval Latin, etc. Typically, the OCR engines themselves are complex systems with their own strengths and weaknesses. Therefore, processing the same input by different systems does not usually lead to an identical output (output data).

Nachdem die Texterkennung in einer oder mehreren OCR-Engines aus der Sammlung 8 stattgefunden hat, empfängt und wertet ein OCR-Validierungs-Engine-Modul 10 (OCR Validation Engine (OVE) Modul) die Ergebnisse aus. Die Auswertung erfolgt in zwei Schritten, die im Folgenden beschrieben werden.After text recognition in one or more OCR engines from the collection 8th has taken place, receives and evaluates an OCR validation engine module 10 (OCR Validation Engine (OVE) module) the results. The evaluation takes place in two steps, which are described below.

In einem Bewertungsschritt des OVE-Moduls 10 wird die Qualität der Transkription (entweder aus den aTR-Systemen der oben beschriebenen Sammlung 8 oder von den unten beschriebenen mTR) anhand der Wörterbuchsuche in einem Wörterbuch-Modul 12 (DICT-Modul) und der Konfidenzinformationen der verschiedenen aTR-Systeme bewertet. Typischerweise liefern solche aTR-Systeme zusammen mit dem Transkriptionsergebnis eine Konfidenzindikation. Die Konfidenzindikationen der einzelnen aTRs werden im OVE-Modul 10 mit einem gemeinsamen Bezugssystem für das resultierende Konfidenzniveau normiert. Transkriptionen werden zusätzlich durch eine Wörterbuchsuche im Wörterbuch-Modul 12 verifiziert. Das Konfidenzniveau von Transkriptionen mit einem Wörterbuchabgleich wird erhöht. In an assessment step of the OVE module 10 the quality of the transcription (either from the aTR systems of the collection described above 8th or from the mTR described below using the dictionary search in a dictionary module 12 (DICT module) and the confidence information of the various aTR systems. Typically, such aTR systems together with the transcription result provide a confidence indication. The confidence indications of the individual aTRs are in the OVE module 10 standardized with a common reference system for the resulting confidence level. Transcriptions are additionally made by a dictionary search in the dictionary module 12 Verified. The confidence level of transcriptions with a dictionary comparison is increased.

Das OVE-Modul 10 wählt die Transkription mit der besten Bewertung für den zweiten Schritt der Validierung aus. Aufgrund ihres jeweiligen Konfidenzniveaus werden Transkriptionen akzeptiert oder für die Nachbearbeitung markiert.The OVE module 10 selects the transcription with the best rating for the second step of the validation. Based on their respective confidence level, transcriptions are accepted or marked for post-processing.

In einer alternativen Ausführungsform verwendet das OVE-Modul 10 die sogenannte alternative Wortabstimmung (alternative word voting), die von einigen aTR-Implementierungen bereitgestellt wird. In diesem Fall stellt eine aTR mehrere mögliche Transkriptionen für ein Wort mit individuellen Konfidenzgraden zur Verfügung. Das OVE-Modul 10 kombiniert die Ergebnisse mehrerer aTRs und wählt die Option mit der höchsten kombinierten Konfidenz. Alle Wortabstimmungen durch einzelne aTRs und alternative Wortabstimmungen (sofern vorhanden) werden verarbeitet und die Konfidenzniveaus berechnet. Die Transkription mit dem höchsten kombinierten Konfidenzniveau wird dann angenommen oder für die Nachbearbeitung berücksichtigt.In an alternative embodiment, the OVE module uses 10 the so-called alternative word voting, which is provided by some aTR implementations. In this case, an aTR provides several possible transcriptions for a word with individual confidence levels. The OVE module 10 combines the results of several aTRs and selects the option with the highest combined confidence. All word votes by individual aTRs and alternative word votes (if available) are processed and the confidence levels are calculated. The transcription with the highest combined confidence level is then accepted or taken into account for post-processing.

In einer weiteren alternativen Ausgestaltung des Systems 1 werden Transkriptionen mit zufriedenstellender Konfidenz vom OVE-Modul 10 für die Nachbearbeitung zufällig ausgewählt, um die Gesamtqualität zu erhöhen und das Selbstlernen zu fördern. Die Zufallsauswahl kann mit dem Konfidenzniveau gewichtet werden, um Transkriptionen mit einem gerade über der Schwelle liegenden Konfidenzniveau für die Nachbearbeitung zu bevorzugen.In a further alternative embodiment of the system 1 become transcriptions with satisfactory confidence from the OVE module 10 randomly selected for post-processing in order to increase the overall quality and promote self-learning. The random selection can be weighted with the confidence level in order to prefer transcriptions with a confidence level just above the threshold for postprocessing.

In einer weiteren Variante von OVE-Modul 10 und DICT-Modul 12 aktualisiert das OVE-Modul 10 das Wörterbuch-Modul 12 bzw. die Wörterbücher (DICT) mit neuen, auf Transkriptionen basierenden Wörtern mit sehr hohem Konfidenzniveau. In einer weiteren Variante werden den aTR-Engines aktualisierte Wörterbücher zur Verbesserung der aTR-Performance zur Verfügung gestellt.In another variant of the OVE module 10 and DICT module 12 updates the OVE module 10 the dictionary module 12 or the dictionaries (DICT) with new words based on transcriptions with a very high level of confidence. In another variant, updated aTR engines are provided with updated dictionaries to improve aTR performance.

Das OVE-Modul 10 bewertet die individuelle Performance der aTRs im Verhältnis zu den in den Metadaten enthaltenen Textattributen (z.B. Sprache, Schriftart). Das OVE-Modul 10 liefert solche Bewertungen an das TAD-Modul 6, um die Zuweisungsentscheidung zu verbessern. Wenn beispielsweise fünf verschiedene aTRs zur Verfügung stehen und die Abstimmung immer nur zwischen drei aTRs erfolgt und die Ergebnisse der beiden anderen aTRs regelmäßig verworfen werden, ordnet das TAD-Modul 6 die Aufgabe zukünftig den oben genannten am besten geeigneten drei aTRs für eine bestimmte Sprache/Schriftart auf der Grundlage der vorherigen Bewertung zu.The OVE module 10 evaluates the individual performance of the aTRs in relation to the text attributes contained in the metadata (e.g. language, font). The OVE module 10 provides such assessments to the TAD module 6 to improve the assignment decision. If, for example, five different aTRs are available and the coordination is only ever between three aTRs and the results of the other two aTRs are regularly rejected, the TAD module orders 6 assign the three most appropriate aTRs for a particular language / font based on the previous assessment.

Transkriptionen, die für die manuelle Nachbearbeitung markiert sind, werden einem Textinformations-Verwaltungssystem-Modul 14 (Text Information Management System (TIMS) Modul) zur Verfügung gestellt. Das Originalbild und die Transkriptionsergebnisse werden dort segmentiert, typischerweise auf Satzebene. Ein entsprechendes Paar, bestehend aus einem Bildsegment und einem Transkriptionssegment, bildet eine Aufgabe zur Nachbearbeitung. Alle Aufgaben, die aus mehreren Transkriptionen erstellt werden, werden in zufälliger Reihenfolge abgearbeitet, um die Rekompilierung größerer Textsegmente durch mTRs (manuelle Texterkenner (manual Text Recognition)) in einer Gruppe 16 zu verhindern.Transcriptions marked for manual post-processing become a text information management system module 14 (Text Information Management System (TIMS) module). The original image and the transcription results are segmented there, typically at the sentence level. A corresponding pair, consisting of an image segment and a transcription segment, forms a post-processing task. All tasks that are created from multiple transcriptions are processed in a random order in order to recompile larger text segments by mTRs (manual text recognition) in a group 16 to prevent.

Das TIMS-Modul 14 ordnet die oben genannten Aufgaben den mTRs (d.h. Personen) anhand ihres Profils zu, das in einem Benutzerverwaltungsmodul (UM-Modul 18) gespeichert ist. Profile können Informationen über Sprachniveaus, die Fähigkeit, spezielle Schriftarten zu lesen, bevorzugte Arbeitszeiten und Qualitätsbewertungen (soweit gesetzlich zulässig) enthalten. Diese mTR-Benutzer verwenden im Ausführungsbeispiel eine Web-Schnittstelle 20, die von einem Webserver (WS)-Modul 22 bereitgestellt wird, um das in einer Aufgabe enthaltene Bild und das Transkriptionsergebnis zu sehen. Die Transkriptionsergebnisse werden von dem jeweiligen mTR bestätigt oder korrigiert. Die Personen der Gruppe 16, die am manuellen Korrekturprozess (mTR) beteiligt sind, können Freiwillige sein oder eine finanzielle Belohnung für ihre Arbeit erhalten.The TIMS module 14 assigns the above tasks to the mTRs (ie people) based on their profile, which is stored in a user management module (UM module 18 ) is saved. Profiles can include information about language levels, the ability to read special fonts, preferred working hours, and quality ratings (where permitted by law). In the exemplary embodiment, these mTR users use a web interface 20 by a web server (WS) module 22 is provided to see the image contained in a task and the transcription result. The transcription results are confirmed or corrected by the respective mTR. The people in the group 16 Those involved in the manual correction process (mTR) can be volunteers or receive a financial reward for their work.

In vorteilhafter Ausführung arbeiten die mTRs mobil, z.B. von zu Hause aus, auf Reisen oder zur Nutzung kurzer Wartezeiten. Ein Stapel von Sätzen wird pro mTR bereitgestellt und lokal in einer mobilen Anwendung gespeichert, während die Konnektivität verfügbar ist. Ein mTR kann an der Transkription des gesamten Stapels in einer Offline-Umgebung arbeiten. Die Ergebnisse werden an das TIMS-Modul 14 gesendet, sobald die Konnektivität wieder hergestellt ist. Diese Variante erlaubt es, den Anwendungsbereich von Personen, die als mTRs arbeiten, auf Regionen mit weniger entwickelter Infrastruktur auszudehnen.In an advantageous embodiment, the mTRs work mobile, for example from home, while traveling or to use short waiting times. A batch of sets is provided per mTR and stored locally in a mobile application while connectivity is available. An mTR can work on the transcription of the entire batch in an offline environment. The results are sent to the TIMS module 14 sent as soon as connectivity is restored. This variant allows the area of application of people working as mTRs to be extended to regions with less developed infrastructure.

Das TIMS-Modul 14 sammelt die Ergebnisse von mTRs über das WS-Modul 22 und wendet eine Wörterbuchsuche im DICT-Modul 12 an, um von mTRs bereitgestellte Korrekturen zu bewerten. Aus den Ergebnissen der Wörterbuchsuche, den Benutzerprofildaten aus dem UM-Modul 18 und dem bisherigen Konfidenzniveau aus dem OVE-Modul 10 wird ein neues Konfidenzniveau abgeleitet. Das TIMS-Modul 14 plant Aufgaben mit unbefriedigendem Konfidenzniveau für die weitere Nachbearbeitung durch zusätzliche mTRs ein, bis ein zufriedenstellendes Konfidenzniveau erreicht ist.The TIMS module 14 collects the results of mTRs via the WS module 22 and applies a dictionary search in the DICT module 12 to evaluate corrections provided by mTRs. From the results of the dictionary search, the user profile data from the UM module 18 and the previous confidence level from the OVE module 10 a new level of confidence is derived. The TIMS module 14 plans tasks with an unsatisfactory confidence level for further post-processing through additional mTRs until a satisfactory confidence level is reached.

In einer alternativen Ausführungsform werden Aufgaben zunächst mehreren mTRs zugeordnet, um eine bessere Bewertung durch Vergleich der Ergebnisse mehrerer mTR-Korrekturen zu ermöglichen.In an alternative embodiment, tasks are initially assigned to several mTRs in order to enable a better evaluation by comparing the results of several mTR corrections.

Ist ein zufriedenstellendes Konfidenzniveau für eine Aufgabe erreicht, gibt das TIMS-Modul 14 eine Rückmeldung an das UM-Modul 18, welche mTRs korrekte oder falsche Transkriptionen geliefert haben. Das UM-Modul 18 aktualisiert dann die Qualitätsstufe im jeweiligen mTR-Profil, um zukünftige Aufgabenzuweisungen und Aufgabenbewertungen durch das TIMS-Modul 14 zu verbessern. Das TIMS-Modul 14 fügt die Aufgaben dann wieder in den ursprünglichen Kontext einer Transkription ein.If a satisfactory level of confidence is reached for a task, the TIMS module gives 14 a feedback to the UM module 18 which mTRs provided correct or incorrect transcriptions. The UM module 18 then updates the quality level in the respective mTR profile to include future task assignments and task evaluations by the TIMS module 14 to improve. The TIMS module 14 then reinserts the tasks into the original context of a transcription.

Das UM-Modul 18 nutzt alternativ das Feedback des TIMS-Moduls 14, um die Belohnungen für die Nutzer (mTR) der Gruppe 16 zu berechnen.The UM module 18 alternatively uses the feedback of the TIMS module 14 to the rewards for users (mTR) of the group 16 to calculate.

BezugszeichenlisteLIST OF REFERENCE NUMBERS

11: System zur optischen ZeichenerkennungOptical character recognition system
22: Textbereich-Analysator-ModulText Area Analyzer Module
44: Textbereich-VorverarbeitungsmodulText Area preprocessing
66: Textbereich-Dispatcher-ModulText Area dispatcher module
88th: Sammlung optischer Zeichenerkennungs-EnginesCollection of optical character recognition engines
1010: OCR-Validierungs-Engine-ModulOCR validation engine module
1212: Wörterbuch-ModulDictionary module
1414: Textinformations-Verwaltungssystem-ModulText information management system module
1616: Gruppe manueller ÜbersetzerGroup of manual translators
1818: BenutzerverwaltungsmodulUser management module
2020: Web-SchnittstelleWeb interface
2222: Webserver-ModulWeb server module

Claims

System (1) for optical character recognition (OCR) from an image containing text, comprising: - A text area dispatcher module (6) which is connected to at least one OCR engine (aTR1, aTR2, ... aTRn) and is configured such that it uses the image as input data to the OCR engine (aTR1, aTR2 , ... aTRn) and receives machine-coded texts as output data from the OCR engine (aTR1, aTR2, ... aTRn), and - a text information management system module (14) configured to segment the image and machine-coded text into pairs of corresponding image segments and text segments, assign the pairs to a plurality of users (mTR) for review, and corrected text segments from those Users (mTR) receives.

System (1) according to Claim 1 , with segmentation taking place at the record level.

System (1) according to Claim 1 or 2 further comprising a dictionary module (12), the text information management system module (14) being further configured to check the corrected text segments in the dictionary module (12).

System (1) according to one of the preceding claims, comprising an interface module (22) which provides a human-machine interface (20) on which the user (mTR) can display pairs assigned to them and enter corrections.

System (1) according to one of the preceding claims, comprising an interface module (22) which is configured such that it transmits a plurality of pairs each assigned to a user (mTR) to a mobile terminal of the user (mTR), and receives corrected text segments generated on the mobile terminal.

System (1) according to one of the preceding claims, further comprising a text area preprocessing module (4) configured to assign a number of attributes to the image, - a user management module (18) in which profiles of a plurality of users (mTR) are stored, the text area preparation module (6) being further configured to assign the pairs on the basis of a comparison of the attributes with the profiles.

System (1) according to Claim 6 The profiles contain one or more information from the following group: language level, ability to read special fonts, preferred working time and quality rating.

System (1) according to Claim 7 , wherein the text information management system module (14) is further configured to assign a pair to a plurality of users (mTR), and based a comparison of the plurality of corrected text segments of the plurality of users (mTR) changes a quality rating in the profile.

Computer program product which can be loaded into the memory of at least one computer and contains software code sections which cause the computer to perform the function of a system (1) according to one of the preceding claims.

Computer program product that is loadable into the memory of at least one mobile terminal and contains software code sections that cause the mobile terminal to receive a plurality of pairs of image segments and text segments assigned to the user (mTR) of the terminal from an interface module (22) and to be displayed on a display element of the terminal, to receive corrections of the text segment by a user interface of the terminal and to send corrected text segments to the interface module (22).