EP3008635A1

EP3008635A1 - Method for automatic thematic classification of a digital text file

Info

Publication number: EP3008635A1
Application number: EP14728537.3A
Authority: EP
Inventors: François-Régis CHAUMARTIN
Original assignee: Proxem
Current assignee: Proxem
Priority date: 2013-06-14
Filing date: 2014-06-04
Publication date: 2016-04-20
Also published as: WO2014198595A1; FR3007164B1; US20160140220A1; FR3007164A1

Abstract

The invention primarily relates to a method for the thematic classification of a digital text file (1) from an encyclopaedic database (5) comprising a category graph (G), said method comprising, during a learning phase (PA) making it possible to develop a thematic classification model (3), the step of grouping together, for each category node, all of the items directly attached to that category node so as to obtain a "word bag" for each category node; determining a so-called term-frequency vector characteristic of the category node; combining, on each category node, the term-frequency vector directly connected to it with term-frequency vectors of more specific nodes; and in that it comprises, during a production phase (PP), a step for calculating the term-frequency vector (V) of said digital text file (1) and selecting, in said thematic classification model (3), N category nodes having the term-frequency vectors (V') closest to the term-frequency vector (V) of the digital text file.

Description

METHOD FOR AUTOMATIC THEMATIC CLASSIFICATION OF A DIGITAL TEXT FILE

[01] TECHNICAL FIELD OF THE INVENTION

[02] The invention relates to an automatic thematic classification method of a digital text file. The invention thus relates to the field of computer science applied to the language.

[03] BACKGROUND TECHNOLOGY

[04] Categorization is the process of associating a given document with one or more predefined categories (or labels). The goal of automatic text categorization is to automatically infer a classification by analyzing its content. The very nature of the predefined categories varies according to the objectives; it may be to identify the language of the text, the themes addressed, but also for example the desired prioritization for the treatment of the document, or the feelings expressed. The difficulty of the task varies according to the type and length of the document: a tweet, an email, a press article, a scientific document or a consumer opinion do not generally be analyzed in the same way.

[05] In addition, the categorization of a digital text file usually requires a significant investment upstream with an adaptation that depends on the scope. Indeed, the operational stages prior to the learning of a classification are most often: i) the constitution of the classification plan, ii) the manual annotation of the learning corpus, iii) the definition of linguistic characteristics used by the learning algorithm. These operations can consume time and their result is generally applicable only to the particular domain concerned by the predefined categories, and to the types of documents representative of the learning corpus.

[06] Methods of automatic learning to categorization are known. Thus, the document Sebastiani, 2002, "Machine Learning in Automated Text Categorization, in ACM, Computing Surveys, Vol 34, No. 1, pages 1-47, provides a comparative table of possible methods and applications. Dasari, 2012, "Text Categorization and Machine Learning Methods: Current State of the Art," GJCST, Vol. 12, No. 1 1, completes this state of the art with more recent approaches and measures the progress made in 10 years.

[07] A question arises about classification plans, usually defined for a particular area. Indeed, it is necessary to know which set of predefined categories would be sufficiently covering to categorize in a reasonably generic way a text coming from anywhere. [08] The categories of the "Wikipedia" online database have recently emerged as a possibility of such a universal ranking scheme. Schoenhofer, P, 2009, "Identifying document topics using the Wikipedia category network," Web Intelligence and Agent Systems, Vol. 7, No. 2, pages 195-207, proposes to use them to carry out a thematic categorization with a simple algorithm that is content to exploit the titles and categories of the articles. A related idea is presented in Yun et al., 201 1, "Topic Extraction Based on Wikipédia Category", Proceedings of Computational Sciences and Optimization (CSO). The Wikipedia categories also serve as a reference in the YAGO ontology disclosed by SUCHANEK F., et al., "YAGO: a core of semantic knowledge", WWW 2007, pp. 697-706.

[09] However, the known methods propose a thematic classification subject to categorization errors due to the raw processing of category data from the Wikipedia database. There is therefore a need for a more robust and accurate method than existing methods.

[010] OBJECT OF THE INVENTION

[011] The invention aims to meet this need by proposing a method of thematic classification of a digital text file from an encyclopedic database comprising a category graph defined by a set of category nodes to each one. of which is attached at least one article, a generic category node being connected to zero, one, or more nodes of more specific category, characterized in that said method comprises, during a learning phase for developing a thematic classification model, the step of grouping, for each category node, all items directly attached to said category node so as to obtain for each category node a set of words called "bag of words", to determine a vector called term-frequency vector characteristic of the category node corresponding to the number of occurrences of each word in the word bag, to combine on each category node the vector term-frequencies which is directly related to it with term-frequency vectors of more specific nodes, and in that it comprises, during a production phase, the step of calculating the term-frequency vector of said digital text file and to retain in said thematic classification model N category nodes having the term-frequency vectors closest to the term-frequency vector of the file. er of digital text. [012] The invention thus makes it possible to process an all-in digital text file in a generic and automatic manner, that is to say without first imposing a learning phase specific to the domain or to the language of the document. . The invention makes it possible to associate finely with an approximate text written in a given language, categories in this language preferably represented in the form of a graph.

[013] The use of an inter-language index of the database may allow in some implementations to obtain a subset of these categories in languages other than that of the original text. It will thus be possible to authorize a cross-language search of the documents associated with a given theme.

[014] According to one implementation, the method further comprises the step of reconstituting a computer representation in the form of a graph of the selected category nodes.

[015] According to one implementation, the method comprises the step of removing any cycles of the category graph to obtain an acyclic oriented graph. [016] According to one implementation, during the learning phase, a category node which is associated with a number of items below a threshold is merged with a more generic category node and the articles that were directly connected to it. are attached to said more generic category node.

[017] According to one embodiment, the combination consists in summing the term-frequency vector of each category node, called the target node, with term-frequency vectors of more specific category nodes directly connected to said target node, so-called nodes of subcategories, said subcategory nodes being weighted.

[018] According to one implementation, for a target node having M sub-category nodes, each term-frequency vector of a subcategory node is weighted with a factor 1 / (M + 1).

[019] According to one implementation, the term-frequency vector or vectors of the N category nodes closest to the term-frequency vector of the digital text file are those maximizing the dot product with the term-frequency vector of the text file. digital.

[020] According to one implementation, said scalar product is weighted with TF.IDF and / or Okapi BM25 type techniques. [021] According to one implementation, the method comprises the step of establishing a classification of the digital text file into categories in a language other than that of the digital text file by means of an inter-language index associating with a text file. category node its translations into other languages. [022] According to one implementation, the method comprises the step of deleting the low relevance category nodes having a degree less than or equal to a threshold.

[023] According to one implementation, the encyclopedic database is the "Wikipedia" (registered trademark) database. [024] According to one implementation, the encyclopedic database consists of consumer opinions grouped by categories.

[025] BRIEF DESCRIPTION OF THE FIGURES

[026] The invention will be better understood on reading the description which follows and on examining the figures that accompany it. These figures are given for illustrative but not limiting of the invention.

[027] Figure 1 is a schematic representation of the various elements involved in the implementation of the automatic classification method according to the invention; [028] Figure 2 shows a diagram of the various steps of the automatic classification method according to the invention;

[029] Figures 3a to 3f show the different processing operations performed on a category graph during a learning phase of the automatic classification method according to the invention; [030] Figures 4a to 4d show the different operations performed on a category graph during a production phase of the automatic classification method according to the invention.

[031] The identical, similar or similar elements retain the same reference from one figure to another. [032] DESCRIPTION OF EXAMPLES OF THE EMBODIMENT OF THE INVENTION

[033] As shown in FIG. 1, the thematic classification method according to the invention makes it possible to automatically provide a list of relevant categories corresponding to a digital text file 1. The list of relevant categories is preferably displayed as a computer representation of a graph G1 in the language L1 corresponding to the language of the digital text file 1. This graph G1 may, if necessary, be transposed in several languages L2, L3, etc. to obtain the corresponding representations G2, G3, .... [034] For this purpose, a classifier 2 preferably taking the form of a search engine uses a thematic classification model 3 providing the list of categories relevant to the file 1 analyzed.

[035] More specifically, the thematic classification model 3 is developed by learning on an encyclopedic database 5 organized into categories to which articles are attached. This database is in this case the database "Wikipédia" (trademark) treated as a "dump.xml" file by the module 8, but could alternatively be any other equivalent database, alternatively, the encyclopedic database consists of consumer reviews grouped by categories.

[036] As shown in FIG. 3a, the encyclopedic database 5 comprises a graph G of categories represented in a simplified manner defined by a set of category nodes Ci each of which is attached to at least one item Ai.j, a node said generic category being connected by an arc to zero, one, or more specific category nodes than the generic category. In one example, the category "shoes" is more generic than the category "boots" or "sports shoes" which are specific categories. [037] In this case, the generic category node C1 is linked to the specific category nodes C2, C3 and C4, which constitute generic category nodes with respect to the specific category nodes C5, C6, C7 and C8. For a node of category Ci given, an arc called "incoming" comes from a node of more generic category, while an arc called "outgoing" is connected to a node of more specific category. In the example shown, we have therefore understood that we go from the most generic category node to the most specific category node when moving from top to bottom. However, this representation is purely arbitrary and could have been reversed. [038] During a learning phase PA making it possible to elaborate the thematic classification model 3, the cycles of the category graph are removed in a step 101 to obtain an acyclic graph G (DAG-Directed Acyclic Graph). and thus avoid infinite loops. [039] For this purpose, the algorithm described in the document Tarjan (1972), "Depth-first search and linear graph algorithms", SIAM Journal on Computing, Vol. 1, No. 2, p. 146-160 which detects the strongly connected areas of a directed graph with a deep exploration from the roots that is to say category nodes Ci devoid of incoming arc. An arc is then locally suppressed until all cycles are removed. The choice of the arc to be deleted is arbitrary and in this case consists in selecting those which connect the lowest category nodes Ci in the hierarchy. Thus, in the example shown, the cycle between the nodes of category C7 and C1 is eliminated to obtain the graph G of FIG. 3b.

[040] Furthermore, during a step 102, a category node Ci associated with a number of items less than a threshold is merged with the more generic category node Ci and the items Ai.j that were it directly connected are attached to said generic category node Ci. In the example shown in FIGS. 3c and 3d, the item threshold Ai.j attached to a category node Ci equaling three, the category node C8 containing too few items A8.1, A8.2, A8 .3 is merged with its more generic C4 category node and the related A8.1 -A8.3 items go back to the more generic C4 category node. [041] For each category node, all the texts of the articles directly attached to the category node are grouped in a step 103 so as to obtain for each category a set of words called "bag of words". English).

[042] A vector Vi said term-frequency vector characteristic of the category node Ci corresponding to the number of occurrences of each word in the "word bag" is determined in a step 104. Thus, as shown in FIG. 3e for example, the vector V4 associated with the node of category C4 is defined by a term t1 having an occurrence f4.1, the term t2 having an occurrence f4; 2, the term t3 having an occurrence f4.3, etc ... the term tk having an occurrence f4.k.

[043] Beforehand, the texts of the articles could have been processed for example via the search engine called "Lucene" according to a sequence of conventional operations in search of information, such as the segmentation of text into words, the normalization of their case, the deletion of diacritics, the elimination of grammatical words ("stop words" like articles in particular), rooting (stemming) and the counting of terms. One of the interests of the engine "Lucene" is to propose as standard these operations for about thirty languages.

[044] Next, the graph G is explored from the most generic roots to the most specific leaves devoid of an outgoing arch, and during the recursive feedback, in a step 105 on each category node Ci is combined the vector terms-frequencies Vi which is directly related to it with term-frequency vectors of more specific category Ci nodes. The objective is to associate with each category node Ci a representative term-frequency vector. The combination is made so that the texts which are directly attached to the node of the category constitute a majority contribution, whereas the texts attached to the more specific categories constitute a minority contribution. In this case, the term-frequency vector Vi of each category node, called said target node, is summed with term-frequency vectors Vi of more specific category Ci nodes directly connected to said target node, said nodes of subcategories Ci , said subcategory nodes Ci being weighted. We then obtain term-frequency vectors called optimized vectors Vi '.

[045] Preferably, for a target node having M subcategory nodes, each term-frequency vector of a subcategory node is weighted with a damping factor (for example 1 / (M + 1)) . Thus, as illustrated in FIG. 3f, during a first step of the recursive ascent the term-frequency vector V4 is replaced by the term-frequency vector V4 '= V4 + 0.5 ^* V7. We specify that the vector V7 is weighted by 0.5 because we have M = 1 nodes of subcategories below C4; so that the linear combination factor is 1 / ((M = 1) +1) = 0.5. This repeats the operations until the complete recovery of the graph.

[046] The categories Ci and their optimized term-frequency vector Vi 'are indexed in a search index 10 stored in the classification model 3. [047] During a production phase PP, during a step 201, the term-frequency vector V of the digital text file 1 is computed in the same way as the term-frequency vector has been calculated. Vi of the articles Ai.j directly attached to a category Ci. [048] The effective classification is carried out by realizing in a step 202 a search in the search index 10 previously constituted by means of the search engine 2 which then returns the list "flat" of the N most relevant categories, that is to say those which have an optimized term-frequency vector Vi 'closest to the term-frequency vector V of the text. N that can be set by the user is typically between 5 and 30. Category list "flat" means categories that are not hierarchical graph form since the categories are not hierarchical in graph in the search index 10.

[049] Preferably, it is considered that the optimized term-frequency vectors Vi 'of the categories closest to the term-frequency vector V of the digital text file 1 are those which maximize the dot product between the term-frequency vector V and the vector optimized frequency-terms Vi 'of a category Ci. Preferably, the dot product is weighted with TF.IDF and / or Okapi BM25 type techniques. [050] Thus, as shown in FIG. 4a, the digital text file 1 will be associated with the nodes of category C3, C4, C7, C1, C2, the parameter "p" corresponding to the level of relevance of each node of category.

[051] In a step 203, the local graph shown in Figure 4b is reconstructed from the categories retained by the search engine 2 using the shape of the category graph. This form of the category graph corresponds to the information 1 1 of the arcs connecting the category nodes stored previously in the classification model at the time of the creation of the search index 10. [052] In the graph G1, it will be possible to adapt the display color of the category Ci nodes to their relevance "p", the category nodes the more relevant having a dark display while the less relevant present a clearer display.

[053] In a step 204, the topology of the graph is used to delete the category Ci nodes of low relevance which are not closely related to the others, for example the nodes of degree 1, that is to say those which comprise a single or no bow. In the example of FIG. 4c, the category node C3 has been deleted.

[054] If the encyclopedic database 5 contains an interlanged index 12 which associates with a category node Ci its translations Ci ', Ci ", etc. in other languages, the exploitation of this index 12 by the model classification 3 directly allows to establish in one step a classification of the text file 1 into categories of another language L2-L3.

[055] Thus, as shown in FIG. 4d, the category nodes C1 'C2' C4 'of the graph G2 correspond to the translation into another language L2 of the initial category nodes C1, C2, C4 in the language L1 whereas there is no match for the category node C7 in the other language L2. Indeed, it should be noted that the completeness of the inter-language index 12 is random and varies according to the category Ci nodes. For those deemed important by the users, links are provided to a large number of languages; on the other hand, no link will sometimes exist for a category that is too fine or of secondary interest.

[056] One skilled in the art will of course be able to make modifications to the above process. Thus, alternatively, the classifier may be based on the use of techniques including HMM (Hiden Markov Model) or SVM (Support Vector Machine) or maximum entropy or neural network.

Claims

1. A method for automatic thematic classification of a digital text file (1) from an encyclopedic database (5) comprising a category graph (G) defined by a set of category nodes (Ci) to each of which is attached to at least one article (Ai.j), a generic category node being connected to zero, one, or more nodes of more specific category, characterized in that said method comprises, during a learning phase (PA) for developing a thematic classification model (3), the step of grouping, for each category node (Ci), all items (Ai.j) directly attached to said category node (Ci) so as to to obtain for each category node (Ci) a set of words called "bag of words", to determine a vector (Vi) said term-frequency vector characteristic of the category node (Ci) corresponding to the number of occurrences of each word in the bag of words, com biner on each category node the vector (Vi) terms-frequencies which is directly connected to it with term-frequency vectors of more specific nodes, and in that it comprises, during a production phase, the step of calculating the term-frequency vector (V) of said digital text file (1) and retaining in said thematic classification model (3) N category nodes having the vectors (Vi ') term-frequencies closest to the vector (V ) frequency-terms of the digital text file (1).

2. Method according to claim 1, characterized in that it further comprises the step of reconstructing a computer representation as a graph (G1) category nodes retained.

3. Method according to claim 1 or 2, characterized in that it comprises the step of removing any cycles of the category graph (G) to obtain an acyclic oriented graph.

4. Method according to one of claims 1 to 3, characterized in that during the learning phase, a category node (Ci) which is associated with a number of items less than a threshold is merged with a node of category more generic and articles (Ai.j) which were directly related thereto are attached to said node more generic category.

5. Method according to one of claims 1 to 4, characterized in that the combination consists in summing the vector (Vi) terms-frequencies of each category node (Ci), said target node, with term-frequency vectors of more specific category (Ci) nodes directly connected to said target node, said subcategory nodes, said subcategory nodes being weighted.

6. Method according to claim 5, characterized in that for a target node having M sub-category nodes, each term-frequency vector (Vi) of a subcategory node is weighted with a factor 1 / (M + 1).

7. Method according to one of claims 1 to 6, characterized in that the one or more term-frequency vectors (Vi ') of the N category nodes closest to the term-frequency vector of the digital text file (1) are those maximizing the scalar product with the term-frequency vector (V) of the digital text file (1).

8. Method according to claim 7, characterized in that said scalar product is weighted with TF.IDF type techniques and / or Okapi BM25.

9. Method according to one of claims 1 to 8, characterized in that it comprises the step of establishing a classification of the digital text file (1) in categories in a language other than that of the digital text file ( 1) by means of an inter-language index (12) associating with a category node (Ci) its translations into other languages.

10. The method as claimed in claim 2, characterized in that it includes the step of deleting category nodes (Ci) of low relevance having a degree less than or equal to a threshold.

1 1. Method according to one of the preceding claims, characterized in that the encyclopedic database (5) is the "Wikipedia" (registered trademark) database.

12. Method according to one of claims 1 to 10, characterized in that the encyclopedic database (5) consists of consumer reviews grouped by categories.