DE19526263C1

DE19526263C1 - Automatic process for classification of text

Info

Publication number: DE19526263C1
Application number: DE19526263A
Authority: DE
Inventors: Thomas Dr Bayer
Original assignee: Daimler Benz AG
Current assignee: Mercedes Benz Group AG
Priority date: 1995-07-19
Filing date: 1995-07-19
Publication date: 1996-11-07
Anticipated expiration: 2015-07-20

Abstract

The process relates to the automatic processing of digitised text based upon normally spoken words. The process uses neural network technology and employs a statistical process for classification purposes. The classification is based upon a large number of descriptors identified from the text and these are used to obtain characteristic vectors from a generated list. The processing is reduced with the aid of a transformation vector. The generation of descriptors is obtained with the use of a number of training texts that are used in the transformation vector.

Description

Die Erfindung betrifft ein Verfahren zur Klassifizierung eines natürlichen Textes nach dem Oberbegriff des Pa tentanspruchs 1.The invention relates to a method for classification a natural text according to the generic term of Pa claim 1.

Die Klassifikation von Texten ist ein wesentlicher Schritt bei der automatischen Verarbeitung digitalisierter Texte und von besonderer Bedeutung für das automatisierte Text verstehen natürlichsprachlicher Texte. Durch die Zuordnung eines Textes zu einer thematisch stärker eingeschränkten Textklasse kann die für die weitere automatische Verarbei tung des Textes erforderliche Wissensbasis in Form von Le xikonspeicher, syntaktischen und semantischen Vorschriften etc., stark eingeschränkt und in vielen Fällen eine Verarbeitung mit vertretbarem Aufwand und akzeptabler Er folgsnote erst durchgeführt werden.The classification of texts is an essential step in the automatic processing of digitized texts and of particular importance for automated text understand natural language texts. By assignment of a text on a thematically more restricted Text class can be used for further automatic processing required knowledge base in the form of Le xicon memory, syntactic and semantic regulations etc., severely restricted and in many cases one Processing with reasonable effort and acceptable Er follow-up grade will only be carried out.

Üblicherweise wird hierfür eine Mehrzahl von Deskriptoren vorgegeben und das Auftreten solcher Deskriptoren in einem zu klassifizierenden Text überprüft. Die Art der Deskrip torenvorgabe beeinflußt auch das Vorgehen bei der Klassi fizierung.A number of descriptors are usually used for this given and the occurrence of such descriptors in one checked text to be classified. The type of descript The goal specification also influences the procedure for the classi fication.

Für Deskriptoren mit Begriffsinhalten wie Wortformen oder mehrere Wörter umfassenden Ausdrücken ist die Klassifizie rung mit regelgestützten Klassifikatoren wie Entschei dungsbäumen angebracht, siehe z. B. "Toward Language Inde pendent Automated Learning of Text Categorization Models" von Apte/Damerau/Weiss in Proceedings of the 17th Int. Conf. on Resarch and Development in Information Retrieval, S. 23-30, Irland 1994.For descriptors with conceptual content such as word forms or multi-word expressions is the classification rule-based classifiers such as decision making attached trees, see e.g. B. "Toward Language Inde Pendent Automated Learning of Text Categorization Models " by Apte / Damerau / Weiss in Proceedings of the 17th Int. Conf. on Resarch and Development in Information Retrieval, Pp. 23-30, Ireland 1994.

Für Deskriptoren die, wie z. B. n-Grame, mehr oder weniger Elemente ohne Bedeutungsinhalt aufweisen, sind statisti sche Klassifizierungstechniken geeigneter. Hierzu zählen z. B. neuronale Netzwerke oder die Vektorabstandsprüfung mit Zuordnung zur Klasse des nächsten Nachbarn. ("N-Gram- Based Text Categorization" von Cavnar/Trenkle in Procee dings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, S. 161-175, Las Vegas, 1994). Hier für bilden die vordefinierten Deskriptoren einen in dimensionalen Vektorraum und für einen zu klassifizieren den Text wird ein Vektor generiert, dessen Komponenten durch Vergleich des Textes mit der Deskriptorenliste ge wonnen werden. Der so erzeugte textspezifische Vektor wird mit einer Mehrzahl von Trainingsvektoren zu Trainings texten mit bekannter Klassenzugehörigkeit verglichen. Der zu klassifizierende Text wird der Klasse zugeordnet, der auch der Trainingsvektor mit dem kleinsten Vektorabstand angehört.For descriptors such as B. n-Grame, more or less Elements that have no meaning are statistical Classification techniques more suitable. Which includes e.g. B. neural networks or vector distance testing with assignment to the class of the nearest neighbor. ("N-gram- Based Text Categorization "by Cavnar / Trenkle in Procee dings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, Las Vegas, 1994). Here for the predefined descriptors form an in dimensional vector space and classify for one the text generates a vector, its components by comparing the text with the list of descriptors be won. The text-specific vector created in this way becomes with a plurality of training vectors for training compared texts with known class affiliation. Of the Text to be classified is assigned to the class that also the training vector with the smallest vector distance listened to.

Der vorliegenden Erfindung liegt die Aufgabe zugrunde, ein Verfahren zur Klassifizierung eines natürlichsprachlichen Textes auf der Basis eines textspezifischen Merkmalsvek tors anzugeben.The present invention is based on the object Procedure for classifying a natural language Text based on a text-specific feature vector tors to specify.

Die Erfindung ist im Patentanspruch 1 beschrieben. Die Un teransprüche enthalten vorteilhafte Ausgestaltungen und Weiterbildungen der Erfindung.The invention is described in claim 1. The Un Claims contain advantageous refinements and Developments of the invention.

Die Erfindung nutzt die Erkenntnis, daß in den textspezi fischen Vektoren eine große Anzahl der Komponenten Null sind, d. h., daß von der bei derartigen Klassifizierungsan sätzen hohen Anzahl von Deskriptoren innerhalb eines Tex tes nur ein geringer Anteil vertreten ist. Darüberhinaus sind nicht alle Deskriptoren von gleicher Wichtigkeit für die Klassifikation. Mit der erfindungsgemäßen Transforma tion kann die Vektordimension erheblich reduziert und so mit ein Klassifikator einfacher entworfen und betrieben werden. Die Komponenten der Transformationsmatrix können dabei so gewählt werden, daß die Komponenten des höherdi mensionalen Merkmalsvektors entsprechend ihrer Bedeutung für die Klassifikation mit unterschiedlicher Gewichtung in den reduzierten Vektor eingehen. Dies ist insbesondere von Vorteil in Verbindung mit der Generierung der Deskriptoren nach im wesentlichen statistischen Verfahren ohne oder mit geringer morphologischer und linguistischer Wissensbasis. The invention uses the knowledge that in the textspezi vectors fish a large number of zero components are, d. that is, that of such classifications set high number of descriptors within a text only a small proportion is represented. Furthermore not all descriptors are of equal importance for the classification. With the transforma according to the invention tion can significantly reduce the vector dimension and so designed and operated more easily with a classifier will. The components of the transformation matrix can be chosen so that the components of the hochdi dimensional feature vector according to their meaning for the classification with different weighting in enter the reduced vector. This is particularly from Advantage in connection with the generation of the descriptors according to essentially statistical procedures with or without low morphological and linguistic knowledge base.

Eine bevorzugte Ausführungsform sieht vor, die Transforma tionsmatrix mit Hilfe der an sich bekannten und bei bei spielsweise Zeichenerkennungsverfahren (OCR) eingesetzten Hauptachsentransformation mit nachfolgender Einschränkung auf die Eigenvektoren mit den höchsten Eigenwerten aus Trainingsvektoren zu bestimmen.A preferred embodiment provides for the transforma tion matrix with the help of the known and at bei for example, character recognition (OCR) Major axis transformation with the following restriction on the eigenvectors with the highest eigenvalues Determine training vectors.

Als Klassifikator wird vorteilhafterweise ein Polynom- Klassifikator, der im einfachsten Fall auch ein linearer Klassifikator sein kann, eingesetzt.A polynomial is advantageously used as the classifier. Classifier, which in the simplest case is also a linear one Classifier can be used.

Die Komponenten des Merkmalsvektors können binärwertig sein und nur das Auftreten oder Fehlen eines Deskriptors in einem überprüften Text wiedergeben. Vorzugsweise reprä sentieren die Werte der Vektorkomponenten zu einem Text jedoch auch dessen Auftretenshäufigkeit in dem überprüften Text.The components of the feature vector can be binary and only the appearance or absence of a descriptor play in a verified text. Preferably reprä present the values of the vector components to a text however also its frequency of occurrence in the checked Text.

Die Erfindung ist nachfolgend anhand eines Beispiels noch veranschaulicht.The invention is based on an example illustrated.

Aus einer Sammlung von 600 Trainingstexten werden in einem Vorverarbeitungsschritt geeignete Deskriptoren ermittelt und in eine Deskriptorenliste eingetragen. Verfahren hierzu sind aus dem Stand der Technik bekannt. Vorteilhaft sind beispielsweise Trigrame oder bevorzugterweise die nach dem in der gleichzeitig eingereichten deutschen Pa tentanmeldung "Verfahren zur Erzeugung von Deskriptoren" beschriebenen Verfahren gewonnenen Deskriptoren. Die Trai ningstexte werden einzeln mit der Deskriptorenliste verg lichen, wobei für jeden Deskriptor die Häufigkeit seines Auftretens in dem Text bestimmt wird. Das Ver gleichsergebnis kann als m-dimensionaler Merkmalsvektor mit m als Anzahl der Deskriptoren dargestellt werden, wo bei die bestimmten Häufigkeiten an den den jeweiligen De skriptoren zugewiesenen Stellen des Merkmalsvektors einge tragen sind. Als Anzahl der Deskriptoren sei m = 2500 an genommen. Man erhält so aus der Sammlung der Training stexte 600 Merkmalsvektoren der Dimension 2500. Diese Trainingsvektoren werden einer Hauptachsentransformation unterzogen, bei welcher sich in an sich bekannter Weise unter der Zielvorgabe der Minimierung von Rekonstruktions fehlern 2500 Eigenvektoren b_i (i = 1 bis in) mit je einem zugeordneten Eigenwert l_i ergeben. Die Paare (b_i, l_i) von Eigenvektoren und zugehörigen Eigenwerten werden der Größe der Eigenwerte nach geordnet. Nur eine Anzahl n von Eigenvektoren zu den n größten Eigenwerten, z. B. n = 200 wird weiterverwandt. Diese Vektoren werden zu einer in × n- Transformationsmatrix zusammengefaßt.In a preprocessing step, suitable descriptors are determined from a collection of 600 training texts and entered in a list of descriptors. Methods for this are known from the prior art. For example, trigrams are advantageous, or preferably the descriptors obtained according to the method described in the simultaneously filed German patent application "Process for generating descriptors". The training texts are compared individually with the list of descriptors, the frequency of their occurrence in the text being determined for each descriptor. The comparison result can be represented as an m-dimensional feature vector with m as the number of descriptors, where the given frequencies are entered at the locations of the feature vector assigned to the respective descriptors. Let m = 2500 be assumed as the number of descriptors. In this way, 600 training vectors of dimension 2500 are obtained from the collection of training texts. These training vectors are subjected to a main axis transformation, in which 2500 eigenvectors b _i (i = 1 to in) are used in a manner known per se, with the aim of minimizing reconstruction errors each result in an assigned eigenvalue l _i . The pairs (b _i , l _i ) of eigenvectors and associated eigenvalues are ordered according to the size of the eigenvalues. Only a number n of eigenvectors to the n largest eigenvalues, e.g. B. n = 200 is used further. These vectors are combined to form an × n transformation matrix.

Die Merkmalsvektoren zu den Trainingstexten werden mittels der Transformationsmatrix in reduzierte Trainingsvektoren der Dimension n transformiert und ein linearer Klassifika tor wird anhand dieser reduzierten Trainingsvektoren und der bekannten Klassenzugehörigkeit der entsprechenden Trainingstexte eingestellt. Klassifikatoren an sich sowie deren Einstellung anhand von Trainingsproben sind aus dem Stand der Technik bekannt.The feature vectors for the training texts are created using the transformation matrix into reduced training vectors the dimension n transformed and a linear classifier is based on these reduced training vectors and the known class of the corresponding Training texts set. Classifiers per se as well their setting based on training samples are from the State of the art known.

Die in der Trainingsphase vorgenommene Einstellung des Klassifikators wird für die Klassifikationsphase beibehal ten. Aus einem zu klassifizierenden Text wird durch Ver gleich mit der Deskriptorenliste ein m-dimensionaler Merk malsvektor bestimmt, welcher mittels der m × n-Transforma tionsmatrix in einen reduzierten Vektor der Dimension n überführt wird. Der Klassifikator wird mit dem reduzierten Vektor gespeist und gibt eine Zuordnung für den zugrunde liegenden Text zu einer von z. B. 6 vorgesehenen Textklas sen aus.The setting of the Classifier is retained for the classification phase From a text to be classified, Ver an m-dimensional note with the descriptor list times vector determined, which by means of the m × n transforma tion matrix into a reduced vector of dimension n is transferred. The classifier is reduced with the Vector fed and gives an assignment for the basis lying text to one of z. B. 6 provided text class sen out.

Claims

1. A method for classifying a natural language text, in which a text is checked for the presence of defined features and values for the components of an m-dimensional feature vector are derived therefrom, characterized in that the m-dimensional feature vector by means of a predetermined transformation matrix of the dimension m × n is converted into a reduced vector with a low dimension n and the classification is carried out using the reduced vector.

2. The method according to claim 1, characterized in that for the vector dimensions in <3 × n.

3. The method according to claim 1 or claim 2, characterized ge indicates that m <1000 is specified.

4. The method according to any one of claims 1 to 3, characterized characterized in that a polynomial classifier is used becomes.

5. The method according to any one of claims 1 to 4, characterized characterized in that the transformation matrix from Merk Color vectors for training texts of a training text collection is determined by referring to the feature vectors of the Training texts applied a major axis transformation and only the eigenvectors become the n largest eigenvec gates for the formation of the transformation matrix will.

6. The method according to any one of claims 1 to 5, characterized characterized in that the predefined characteristics from the Training text collection can be won.

7. The method according to any one of claims 1 to 6, characterized characterized in that the components of the m-dimensional Merkinalsvectors are binary and only the De detect or non-detect a feature represent.

8. The method according to any one of claims 1 to 6, characterized characterized in that the components of the feature vector in each case the frequency of occurrence of a characteristic in the represent checked text.