DE102005056713A1

DE102005056713A1 - Document e.g. image document, verifying method, involves searching information accepted as correct and to be used uniformly by given information in reference database, and replacing given information by information accepted as correct

Info

Publication number: DE102005056713A1
Application number: DE102005056713A
Authority: DE
Inventors: Arthur Pease
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2005-11-28
Filing date: 2005-11-28
Publication date: 2007-05-31
Also published as: WO2007060073A1

Abstract

The method involves extracting given information using an information extraction tool according to specified rules for recognition of the information from a document. Information accepted as correct and to be used uniformly is searched by the given information in a reference database using the rules for recognition of comparable information, and the given information is replaced by the information accepted as correct. An independent claim is also included for a device for verifying a document.

Description

Die Erfindung betrifft ein Verfahren und eine Vorrichtung zur Überprüfung von Dokumenten, bei dem/der ein Bild-/Text-Dokument, insbesondere bereits während seiner Erstellung, automatisch auf seine Korrektheit hin überprüft wird und dann evtl. automatisch enthaltene Fehler im Dokument markiert oder beseitigt werden.The The invention relates to a method and a device for checking Documents in which an image / text document, especially during its Creation, automatically checked for correctness and then possibly automatically contained errors in the document marked or be eliminated.

Ein solches Verfahren bzw. eine solche Vorrichtung ist aus modernen Textverarbeitungsprogrammen in Form einer Rechtschreibungs- und Grammatiküberprüfung, also einer Syntax-Prüfung, hinlänglich bekannt.One such a method or device is modern Word processing programs in the form of a spelling and grammar check, ie a syntax check, well known.

Die Richtigkeit der bei der Erstellung des Dokuments verwendeten Fakten bzw. Daten wird dadurch natürlich nicht überprüft. Der Ersteller eines Dokuments benutzt hierfür bislang häufig ein Informationsnetz, z.B. das Internet, zu einer manuellen Überprüfung der im Dokument verwendeten Fakten. Dies ist jedoch zeitaufwändig und es treten bspw. Probleme durch inkonsistente, nicht aktuelle oder zu ungenaue Angaben auf.The Correctness of the facts used in the preparation of the document or data becomes natural not checked. Of the Creator of a document often uses an information network for this, e.g. the Internet, for a manual review of the document used Facts. However, this is time consuming and there are, for example, problems due to inconsistent, non-current or too inaccurate information.

Aus der Veröffentlichung IEEE Computer Society, IT Pro November | Dezember sind so genannte „Information Extraktion Tools" bzw. IE-Werkzeuge bekannt, die in einem „Meer von Text" bestimmte Informationen finden. Dies geschieht dadurch, dass diese Werkzeuge bestimmte Entitäten, wie beispielsweise Personen, Organisationen, Namen, Orte, Zeiten, Geldbeträge; bestimmte Relationen zwischen diesen Entitäten, wie beispielsweise „beschäftigt bei", „Frau von", „Eigentümer von" oder „geboren in" und Ereignisse, wie z.B. „Meeting", „Vertragsabschluss" oder „Kauf von Firma" erkennen. Solche IE-Werkzeuge nutzen linguistische Konventionen sowie Interpretations- und Referenzierungsregeln und sind häufig auch lernfähig.Out the publication IEEE Computer Society, IT Pro November | December are so-called "information Extraction Tools "resp. IE tools known in a "sea of text" specific information Find. This happens because these tools have certain entities, such as for example, persons, organizations, names, locations, times, amounts of money; certain Relations between these entities, such as "busy at", "wife of", "owner of" or "born in "and events, such as. "Meeting", "Contracting" or "Buying Company "recognize. Such IE tools use linguistic conventions as well as interpretation and referencing rules and are often able to learn.

Die Erfindung zu Grunde liegende Aufgabe besteht nun darin ein Verfahren und eine Vorrichtung zur automatischen Überprüfung von Bild-/Text-Dokumenten derart anzugeben, dass die oben angegebenen Nachteile vermieden werden.The Invention underlying task is now a method and a device for automatically checking image / text documents specify so that the above-mentioned disadvantages avoided become.

Diese Aufgabe wird erfindungsgemäß hinsichtlich des Verfahrens durch Merkmale des Anspruchs 1 und hinsichtlich der Vorrichtung durch die Merkmale des Anspruchs 5 gelöst.These The object is achieved according to the invention the method by features of claim 1 and in terms of Device solved by the features of claim 5.

Die weiteren Ansprüche betreffen vorteilhafte Ausgestaltungen des erfindungsgemäßen Verfahrens.The further claims relate to advantageous embodiments of the method according to the invention.

Die Erfindung besteht im Wesentlichen darin, dass mit Hilfe eines Werkzeugs zur Informationsextraktion nach bestimmten Regeln zur Erkennung von Fakten aus einem Dokument mindestens ein angegebenes Faktum extrahiert wird, jeweils zu einem angegebenen Faktum in einer Referenzdatenbank mit Hilfe von bestimmten Regeln zur Erkennung vergleichbarer Fakten ein betreffendes einheitlich zu verwendendes und als richtig angenommenes Faktum gesucht wird und dann das angegebene Faktum automatisch oder auf Wunsch durch das einheitlich zu verwendende und als richtig angenommene Faktum ersetzt wird, falls ein solches gefunden wurde.The Invention essentially consists in that with the help of a tool for information extraction according to certain rules for recognition facts from a document at least one specified fact is extracted, each to a specified fact in a reference database using certain rules to detect comparable facts a subject to be used unified and accepted as correct Fact is searched for and then the specified fact automatically or on request by the uniform to use and as correct is replaced, if found.

Nachfolgend wird die Erfindung anhand bevorzugter Anwendungsbeispiele näher erläutert.following The invention is explained in more detail with reference to preferred application examples.

In einem Textdokument wird mit Hilfe eines Werkzeugs zur Informationsextraktion nach bestimmten Regeln zur Erkennung von Fakten aus einem Dokument mindestens ein angegebenes Faktum extrahiert.In A text document is created using an information extraction tool according to certain rules for recognizing facts from a document extracted at least one specified fact.

Solche Regeln zur Extraktion eines Faktums sind z.B.:
Faktum = Zeitangabe + Firmenname + Ortsangabe „beschäftigt(e)" + Zahl + „Mitarbeiter" | „Ingenieure"
sowie auch semantische Äquivalente dieser Regel wie bspw. Faktum = „auf der Mitarbeiterliste von" + Firmenname + Ortsangabe + „sind in" | „waren in" Zeitangabe + Zahl + „Personen" | Ingenieure + „genannt" | „aufgeführt „eingetragen".
und auch alle syntaktisch korrekten Äquivalente aller dieser semantisch äquivalenten Regeln.Such rules for extracting a fact include:
Fact = time + company name + location "busy" + number + "employee" | "Engineers"
as well as semantic equivalents of this rule such as factum = "on the employee list of" + company name + location + "are in" | "Were in" time + number + "persons" | Engineers + "called" | "listed" registered ".
and also all syntactically correct equivalents of all these semantically equivalent rules.

Sobald also bspw. der Satz oder Satzteil „2004 Siemens USA beschäftigte 63000 Mitarbeiter" eingegeben wird, wird dieser als Faktum mit Hilfe der oben genannten Regel erkannt und zu diesem angegebenen Faktum in einer Referenzdatenbank mit Hilfe von bestimmten Regeln zur Erkennung vergleichbarer Fakten ein betreffendes einheitlich zu verwendendes und als richtig angenommenes Faktum gesucht.As soon as So, for example, the sentence or phrase "2004 Siemens USA employed 63000 Employee "entered This is considered a fact with the help of the above rule detected and to this specified fact in a reference database using certain rules to detect comparable facts a subject to be used unified and accepted as correct Fact searched.

Vergleichbare Fakten könnten hier bspw. alle Fakten mit folgenden Angaben
Firmenname = Siemens
Ortsangabe = USA
Zeitangabe = 2004
Beschäftigte = beliebig
sein und in der Referenzdatenbank gesucht und gefunden werden:
Als Ergebnis erscheinen bspw. folgende vergleichbaren Fakten aus denen der Anwender dann auswählen kann.
Mitarbeiter = 64000
Ingenieure = 30000
Kaufleute = 10000 Comparable facts could, for example, all facts with the following information
Company name = Siemens
Location = USA
Time = 2004
Employees = any
be and be searched and found in the reference database:
As a result, for example, the following comparable facts appear from which the user can then select.
Employee = 64000
Engineers = 30000
Merchants = 10000

Aufgrund der Angabe „Mitarbeiter" im eingegebenen Satz oder Satzteil könnte hier aber bspw. auch gleich eine automatische Ersetzung des angegebenen Wertes „63000" durch den einheitlich zu verwendenden und als richtig angenommenen Wert „64000" der Referenzdatenbank erfolgen.Due to the specification "employee" in the entered sentence or phrase, however, an automatic replacement of the given value "63000" by the uniformly used and taken as correct value "64000" of the reference database.

Neben der Ersetzung einzelner Wörter ist aber auch die Ersetzung mehrerer Wörter bis hin zum gesamten eingegebenen Faktum möglich, wenn bspw. die Reihenfolge der Worte geändert werden muss.Next the replacement of individual words but is also the replacement of several words up to the entire input Fact possible, if, for example, the order of the words has to be changed.

In Dokumenten ist die Bedeutung einer Zahl meist durch eine praktisch ummittelbare folgende Benennungsangabe gut erkennbar und zur Extraktion von Fakten vorteilhaft nutzbar.In Documents is the meaning of a number mostly through a practical one The following naming statement is clearly recognizable and for the extraction of Facts usable.

Eine weitere Ausgestaltung der Erfindung besteht darin, dass auf Bildern dargestellte Textinformationen bspw. mit OCR (optical character recognition) ermittelt werden und damit die Korrektheit diesbezüglicher Angaben in zugehörigen Begleittexten überprüft wird.A Another embodiment of the invention is that on pictures represented text information, for example, with OCR (optical character recognition) and thus the correctness in this regard Information in associated Accompanying texts is checked.

Eine letzte Ausgestaltung der Erfindung besteht darin, dass auf Bilddokumenten dargestellten Personen und/oder Gegenstände mit Hilfe von Bilderkennungs-/Vergleichsverfahren oder aber direkt mit Hilfe der strukturierten Angaben moderner Bildbeschreibungsdateien ermittelt und mit Daten einer Referenzdatenbank verglichen werden, um dann die Korrektheit diesbezüglicher Angaben in zugehörigen Begleittexten zu überprüfen und ggf. Bilder bzw. Fakten in Texten entsprechend passend auszutauschen.A last embodiment of the invention is that on image documents represented persons and / or objects by means of image recognition / comparison method or directly using the structured data of modern image description files determined and compared with data from a reference database, then the correctness in this regard Information in associated accompanying texts to check and If necessary, exchange pictures or facts in texts appropriately.

Das erfindungsgemäße Verfahren erfolgt vorteilhafter Weise weitgehend schritthaltend während der Erstellung eines Textes und einer jeweils vorausgehenden syntaktischen Überprüfung.The inventive method advantageously takes place largely keeping pace during the Creation of a text and a respective preceding syntactic review.

Claims

Procedure for checking documents, - in which using a tool for information extraction according to specific Rules for detecting facts from a document at least one extract the specified fact, - in each case to a specified fact in a reference database using certain rules for the detection of comparable facts a subject uniformly sought to be used and accepted as correct fact and - at then the specified fact automatically or by request the uniformly used and accepted as correct fact is replaced.

Method according to claim 1, - where a fact at least from a first entity / event indication, a second entity / event indication and a Relationship exists between the two, - recognized by the facts be that particular entity / event information from a given list of entity / event information and relations from a list of given relations in a given one Way in the document occur - in the comparable facts recognized by equal pairs of first entity / event indications and relations be and - at the second entity / event information the comparable facts of the document and the reference database with the help of tolerance rules are examined, whether a replacement the respective fact or not.

The method of claim 2, wherein an entity / event indication is either a name or description parameter of an image file, the further entity / event indication represents the name of the object shown in the picture and the Relation expresses this circumstance.

Method according to one of the preceding claims, in the document already during his Creation is checked again and again.

Device for checking documents - at an information extraction tool is present in such a way that according to certain rules for recognizing facts from a document at least one specified fact is extracted, - in the a reference database is so available, with the help of certain Rules for identifying comparable facts to a specified Fact a subject to be used consistently and as correct accepted fact is sought and - at the one program unit for text replacement is present such that the specified fact automatically or on request by the uniform to use and as correct accepted fact is replaced.