FR3059797A1

FR3059797A1 - AUTOMATED METHOD OF ESTABLISHING THESAURUS OF NAMED ENTITIES THAT CAN INCLUDE A PLURALITY OF HIERARCHICAL LEVELS, AND THE USE OF SUCH THESAURUS

Info

Publication number: FR3059797A1
Application number: FR1662034A
Authority: FR
Inventors: Christophe Lecante; Florian Carichon; Romain Billet
Original assignee: Tecknowmetrix Sas
Current assignee: Tecknowmetrix Sas
Priority date: 2016-12-07
Filing date: 2016-12-07
Publication date: 2018-06-08
Anticipated expiration: 2036-12-07
Also published as: FR3059797B1

Abstract

L'invention porte sur un procédé d'élaboration automatique à partir de données textuelles d'un thesaurus d'entités nommées, une entité nommée pouvant comprendre une pluralité d'entités de niveaux hiérarchiques différents, comprenant : une étape d'extraction, à partir des données textuelles, d'une chaîne de caractères désignant une entité nommée ; une étape de traitement de la chaîne de caractères pour établir une liste comprenant au moins un segment d'entité ; une étape d'élimination de la liste des segments d'entité déjà présents dans le thesaurus, pour former une liste de segments nouveaux ; et une étape de mise à jour du thesaurus pour chaque segment nouveau de la liste des segments nouveaux. L' invention porte également sur un procédé d' indexation et d'identification automatique de données textuelles, faisant appel au thésaurus.A method for automatically generating text data from a named entity thesaurus, wherein a named entity may include a plurality of entities of different hierarchical levels, including: an extraction step, from text data, of a character string designating a named entity; a string processing step for establishing a list comprising at least one entity segment; a step of eliminating the list of entity segments already present in the thesaurus, to form a list of new segments; and a step of updating the thesaurus for each new segment of the list of new segments. The invention also relates to a method for automatically indexing and identifying textual data, using the thesaurus.

Description

PROCEDE D'ELABORATION AUTOMATISE D'UN THESAURUS D'ENTITES NOMMEES POUVANT COMPRENDRE UNE PLURALITE DE NIVEAUX HIERARCHIQUES, ET UTILISATION D'UN TEL THESAURUS.PROCESS FOR THE AUTOMATED DEVELOPMENT OF A THESAURUS OF NAMED ENTITIES THAT MAY INCLUDE A PLURALITY OF HIERARCHICAL LEVELS, AND USE OF SUCH A THESAURUS.

_ L'invention porte sur un procédé d'élaboration automatique à partir de données textuelles d'un thésaurus d'entités nommées, une entité nommée pouvant comprendre une pluralité d'entités de niveaux hiérarchiques différents, comprenant: une étape d'extraction, à partir des données textuelles, d'une chaîne de caractères désignant une entité nommée; une étape de traitement de la chaîne de caractères pour établir une liste comprenant au moins un segment d'entité; une étape d'élimination de la liste des segments d'entité déjà présents dans le thésaurus, pour former une liste de segments nouveaux; et une étape de mise à jour du thésaurus pour chaque segment nouveau de la liste des segments nouveaux. L'invention porte également sur un procédé d'indexation et d'identification automatique de données textuelles, faisant appel au thésaurus.The invention relates to a method for automatically developing from textual data a thesaurus of named entities, a named entity which may include a plurality of entities of different hierarchical levels, comprising: an extraction step, to starting from textual data, from a character string designating a named entity; a step of processing the character string to establish a list comprising at least one entity segment; a step of eliminating the list of entity segments already present in the thesaurus, to form a list of new segments; and a step of updating the thesaurus for each new segment of the list of new segments. The invention also relates to a method of automatic indexing and identification of textual data, using the thesaurus.

ii

PROCEDE D'ELABORATION AUTOMATISE D’UN THESAURUS D’ENTITES NOMMEES POUVANT COMPRENDRE UNE PLURALITE DE NIVEAUX HIERARCHIQUES, ET UTILISATION D’UN TEL THESAURUS.PROCESS FOR THE AUTOMATED DEVELOPMENT OF A THESAURUS OF NAMED ENTITIES THAT MAY INCLUDE A PLURALITY OF HIERARCHICAL LEVELS, AND USE OF SUCH A THESAURUS.

DOMAINE DE L'INVENTIONFIELD OF THE INVENTION

La présente invention concerne le traitement automatisé de données textuelles. Elle vise plus précisément l’exploitation de données textuelles pour une analyse fine de l'information que ces données comprennent, notamment lorsque ces données se rapportent à des entités nommées pouvant présenter plusieurs niveaux hiérarchiques.The present invention relates to automated processing of textual data. It more specifically aims at the exploitation of textual data for a fine analysis of the information that this data includes, in particular when this data relates to named entities which can have several hierarchical levels.

ARRIERE PLAN TECHNOLOGIQUE DE L'INVENTIONTECHNOLOGICAL BACKGROUND OF THE INVENTION

Il est connu d'employer des moyens automatisés d'extraction d'informations à partir de collections de données textuelles pour former des bases de données à partir desquels il est possible de mener des analyses plus poussées. Pour constituer ces bases, on peut être amené à repérer dans ces informations d'origine des catégories d'entités nommées (adresse, organisation, gamme de produits...) et à reconnaître toutes les occurrences d'une même entité nommée.It is known to use automated means of extracting information from collections of textual data to form databases from which it is possible to conduct more in-depth analyzes. To constitute these bases, one can be led to locate in this original information categories of named entities (address, organization, range of products ...) and to recognize all occurrences of the same named entity.

La disparité des formes, graphique ou syntagmatique, renvoyant à une même entité nommée pose problème. L'origine de cette disparité bien connue : il peut s'agir d'erreur typographique ou orthographique dans les informations d'origine, des problèmes liés à la transcription et/ou à la translittération de certains termes, à la disparité des sources d'information et notamment à l'hétérogénéité lexicale de ces sources. Pour tenter de s'affranchir de cette variabilité de forme, on prévoit généralement, lors de la constitution d'une base de données, une étape de normalisation afin de ramener toutes les dénominations se rattachant à une même entité à une forme standard, dite forme normalisée.The disparity of forms, graphic or syntagmatic, referring to the same named entity poses a problem. The origin of this well-known disparity: it can be a typographical or spelling error in the original information, problems related to the transcription and / or transliteration of certain terms, the disparity of sources of information and in particular the lexical heterogeneity of these sources. In an attempt to overcome this variability in form, there is generally provided, during the constitution of a database, for a standardization step in order to reduce all the names relating to the same entity to a standard form, called form. normalized.

À ce propos, on pourra se référer au document « normalisation des entités nommées : allier règles déclaratives, ressources endogènes et processus centré sur l'utilisateur » par V. Andréani et al, Revue Canadienne des Sciences de l'Information et de Bibliothéconomie, Volume 35 (3) (pp. 229-263).In this regard, we can refer to the document "normalization of named entities: combining declarative rules, endogenous resources and user-centered process" by V. Andréani et al, Canadian Journal of Information Science and Library Science, Volume 35 (3) (pp. 229-263).

Certaines entités nommées peuvent présenter plusieurs niveaux hiérarchiques. C'est par exemple le cas des organisations telles que les entreprises (siège, filiale, direction, services); les établissements d'enseignement et de recherche (université, laboratoire, service); ou même des adresses de lieux (pays, ville, rue) .Certain named entities can have several hierarchical levels. This is for example the case of organizations such as companies (head office, subsidiary, management, services); educational and research establishments (university, laboratory, service); or even addresses of places (country, city, street).

On ne sait généralement pas à l'avance quelle exploitation sera faite des données textuelles lorsqu'elles ont été organisées en base de données. En particulier, et concernant les entités nommées pouvant présenter plusieurs niveaux hiérarchiques, on ne peut prédire à l'avance selon quel niveau hiérarchique on souhaitera sélectionner les données de la base.It is generally not known in advance how text data will be used when it has been organized into a database. In particular, and concerning the named entities that can have several hierarchical levels, we cannot predict in advance according to which hierarchical level we wish to select the data from the database.

Ainsi, le document de l'état de la technique précitée propose de prendre le niveau hiérarchique le plus haut comme forme normalisée de l'entité nommée. Par exemple, et selon ce document, l'entité nommée « Laboratoire Y de l'Université X » est normalisée en « Université X ». Ce choix ne permet pas de procéder une analyse fine à partir de la base de données constituée, par exemple rechercher l'ensemble des informations relatives au Laboratoire Y de l'Université X.Thus, the aforementioned prior art document proposes to take the highest hierarchical level as the standardized form of the named entity. For example, and according to this document, the entity named "Laboratory Y of University X" is standardized in "University X". This choice does not allow a detailed analysis to be carried out from the database, for example to search for all the information relating to Laboratory Y of University X.

Alternativement, on pourrait choisir de prendre le niveau hiérarchique le plus bas comme forme normalisée. On pourrait par exemple normaliser l'entité nommée de l'exemple précédent en « Laboratoire Y, Université X ». Mais cette approche ne permet pas de facilement mener une analyse portant sur un niveau supérieur (rechercher l'ensemble des informations relatives à l'Université X) sauf à regrouper manuellement les informations extraites pour chacune des entités nommées qui se rattache à une même entité de niveau supérieur. Or ce travail peut être laborieux et de doit être renouvelé pour chaque nouvelle requête.Alternatively, one could choose to take the lowest hierarchical level as the standard form. We could for example normalize the named entity of the previous example into "Laboratory Y, University X". However, this approach does not make it easy to conduct an analysis on a higher level (search for all the information relating to University X) except by manually grouping the information extracted for each of the named entities which is linked to the same entity. higher level. However, this work can be laborious and must be repeated for each new request.

OBJET DE L'INVENTIONOBJECT OF THE INVENTION

La présente invention vise à pallier tout ou partie des inconvénients précités. Elle vise notamment à permettre la sélection d'enregistrements d'une base de données selon différents niveaux hiérarchiques d'une entité nommée, sans nécessiter un retraitement de la base de données ou le regroupement manuel des formes normalisées d'entités se rattachant hiérarchiquement à une même entité. En d'autres termes, l'invention vise à fournir un outil permettant à un utilisateur de choisir, à volonté, le niveau hiérarchique de travail dans lequel il souhaite évoluer.The present invention aims to overcome all or part of the aforementioned drawbacks. It aims in particular to allow the selection of records from a database according to different hierarchical levels of a named entity, without requiring a reprocessing of the database or the manual grouping of standardized forms of entities attached hierarchically to a same entity. In other words, the invention aims to provide a tool allowing a user to choose, at will, the hierarchical level of work in which he wishes to evolve.

BREVE DESCRIPTION DE L'INVENTIONBRIEF DESCRIPTION OF THE INVENTION

En vue de la réalisation de l'un de ces buts, l'objet de l'invention propose un procédé d'élaboration automatique à partir de données textuelles d'un thésaurus d'entités nommées, une entité nommée pouvant comprendre une pluralité d'entités de niveaux hiérarchiques différents, le thésaurus formant une base de données constituée d'une pluralité d'enregistrements d'entités, chaque enregistrement associant pour une entité déterminée :With a view to achieving one of these aims, the object of the invention provides a method of automatic elaboration from textual data of a thesaurus of named entities, a named entity being able to comprise a plurality of entities of different hierarchical levels, the thesaurus forming a database made up of a plurality of entity records, each record associating for a determined entity:

- une forme normalisée de l'entité déterminée ;- a standardized form of the specified entity;

- au moins une chaîne de caractères extraite des données textuelle et désignant l'entité déterminée ;- at least one character string extracted from the textual data and designating the determined entity;

- et, le cas échéant, au moins un lien vers un autre enregistrement du thésaurus correspondant à une entité avec laquelle l'entité déterminée est reliée hiérarchiquement.- and, if applicable, at least one link to another record in the thesaurus corresponding to an entity with which the determined entity is linked hierarchically.

Selon l'invention, le procédé d'élaboration automatique du thésaurus comprend :According to the invention, the process for the automatic production of the thesaurus comprises:

- une étape d'extraction, à partir des données textuelles, d'une chaîne de caractères désignant une entité nommée ;- a step of extracting, from the textual data, a character string designating a named entity;

- une étape de traitement de la chaîne de caractères pour établir une liste comprenant au moins un segment d'entité ;a step of processing the character string to establish a list comprising at least one entity segment;

- une étape d'élimination de la liste des segments d'entité déjà présents dans le thésaurus, pour former une liste de segments nouveaux ;- a step of eliminating the list of entity segments already present in the thesaurus, to form a list of new segments;

- une étape de mise à jour du thésaurus pour chaque segment nouveau de la liste des segments nouveaux- a step of updating the thesaurus for each new segment of the list of new segments

Selon d'autres caractéristiques avantageuses et non limitatives de l'invention, prises seules ou selon toute combinaison techniquement réalisable :According to other advantageous and non-limiting characteristics of the invention, taken alone or in any technically feasible combination:

• l'étape de traitement comprend une étape de délimitation d'au moins un segment d'entité par identification de signes de ponctuation dans la chaîne de caractères ;• the processing step comprises a step of delimiting at least one entity segment by identifying punctuation marks in the character string;

• l'étape de traitement comprend une étape de détermination du niveau hiérarchique associé à chaque segment d'entité ;• the processing step includes a step of determining the hierarchical level associated with each entity segment;

• l'étape de traitement comprend une étape de réécriture de chaque segment d'entité ;• the processing step includes a step of rewriting each entity segment;

• l'étape de traitement comprend l'identification d'amorces types dans chaque segment d'entité, les amorces type étant contenues dans un lexique de traitement ;• the processing step includes the identification of standard primers in each entity segment, the standard primers being contained in a processing lexicon;

• le lexique de traitement associe une amorce type à un niveau hiérarchique et à une règle de réécriture de l'amorce type dans un segment d'entité ;• the processing lexicon associates a standard primer with a hierarchical level and a rule for rewriting the standard primer in an entity segment;

• l'amorce type est un mot vide et la règle de réécriture consiste à supprimer le mot vide dans le segment d'entité ;• the standard leader is an empty word and the rewrite rule consists in deleting the empty word in the entity segment;

• le procédé comprend une étape de correction d'erreurs des segments d'entité ;• the method comprises a step of correcting errors of the entity segments;

• l'étape de mise à jour du thésaurus comprend une étape de création d'un nouvel enregistrement ;• the step of updating the thesaurus includes a step of creating a new record;

• l'étape de mise à jour du thésaurus comprend une étape d'association d'un segment nouveau à un enregistrement préexistant ;• the step of updating the thesaurus includes a step of associating a new segment with a preexisting record;

• l'étape de mise à jour est soumise à la validation et/ou à la modification préalable d'un utilisateur ;• the update step is subject to validation and / or prior modification by a user;

Selon un autre aspect, l'invention concerne un procédé d'indexation automatique de données textuelles comprenant :According to another aspect, the invention relates to a method for automatic indexing of textual data comprising:

- une étape d'extraction à partir des données textuelles d'une chaîne de caractères désignant une entité nommée pouvant comprendre une pluralité d'entités de niveaux hiérarchiques différents ;a step of extracting from the textual data a character string designating a named entity which may include a plurality of entities of different hierarchical levels;

- une étape de traitement de la chaîne de caractères pour établir une liste comprenant au moins un segment d'entité;a step of processing the character string to establish a list comprising at least one entity segment;

- une étape de recherche, dans un thésaurus d'entités nommées établi à l'aide du procédé d'élaboration de l'invention, d'un enregistrement concordant pour chaque segment d'entité de la liste;- a step of searching, in a thesaurus of named entities established using the method of elaborating the invention, of a matching record for each entity segment of the list;

- une étape de liaison, dans une table d'indexation, de l'enregistrement concordant ou d'une référence à l'enregistrement concordant avec les données textuelles ou une référence aux données textuelles.a step of linking, in an indexing table, the matching record or a reference to the record matching with the text data or a reference to the text data.

L'invention concerne également un procédé d'identification automatique de données textuelles à l'aide d'une table d'indexation obtenue selon le procédé précédent, dans lequel le procédé d'identification comprend :The invention also relates to a method for automatic identification of textual data using an indexing table obtained according to the preceding method, in which the identification method comprises:

- une étape d'obtention d'une entité désignée sur laquelle doit porter l'identification ;- a step of obtaining a designated entity to which the identification must relate;

- une étape de recherche, dans le thésaurus d'entités, de l'enregistrement correspondant à l'entité désignée ;- a search step, in the entity thesaurus, of the record corresponding to the designated entity;

- une étape d'identification, à partir de la table d'indexation, des données textuelles associées à l'enregistrement correspondant à l'entité désignée.a step of identifying, from the indexing table, the textual data associated with the record corresponding to the designated entity.

Finalement, l'invention concerne un programme d'ordinateur contenant des instructions adaptées à la mise en œuvre de chacune des étapes d'un des procédés présentés précédemment, lorsque le programme est exécuté sur un ordinateurFinally, the invention relates to a computer program containing instructions adapted to the implementation of each of the steps of one of the methods presented above, when the program is executed on a computer.

BREVE DESCRIPTION DES DESSINSBRIEF DESCRIPTION OF THE DRAWINGS

D'autres caractéristiques et avantages de l'invention ressortiront de la description détaillée de l'invention qui va suivre en référence aux figures annexées sur lesquels :Other characteristics and advantages of the invention will emerge from the detailed description of the invention which will follow with reference to the appended figures in which:

la figure 1 représente un exemple d'un thésaurus élaboré selon une méthode conforme à l'invention ;Figure 1 shows an example of a thesaurus developed according to a method according to the invention;

la figure 2 représente une séquence d'étapes d'un procédé d'indexation automatique de données textuelles conforme à l'invention;FIG. 2 represents a sequence of steps of an automatic indexing method of textual data according to the invention;

la figure 3 représente une séquence d'étapes d'un procédé d'identification automatique de données textuelles conforme à l'invention ;FIG. 3 represents a sequence of steps of a method for automatic identification of textual data in accordance with the invention;

la figure 4 représente une séquence d'étapes d'un procédé d'élaboration automatique d'un thésaurus conforme à l'invention.FIG. 4 represents a sequence of steps of a process for the automatic production of a thesaurus according to the invention.

DESCRIPTION DETAILLEE DE L'INVENTIONDETAILED DESCRIPTION OF THE INVENTION

Thésaurus d'entités.Thesaurus of entities.

La présente invention prévoit d'élaborer un thésaurus d'entités, chaque entité nommée pouvant comprendre une pluralité d'entités de niveaux hiérarchiques différents. Comme cela sera détaillé dans le détail par la suite, ce thésaurus est élaboré de manière automatique à partir d'une collection de données textuelles sources.The present invention provides for developing a thesaurus of entities, each named entity being able to include a plurality of entities of different hierarchical levels. As will be detailed in detail later, this thesaurus is produced automatically from a collection of source textual data.

Par « données textuelles », on désigne toute information représentée sous la forme d'un texte. Il peut ainsi s'agir d'information scientifique et techniques, tels que contenus dans des brevets, des articles scientifiques. Ces informations peuvent être non structurées ou structurées (par exemple organisées en base de données, dans lequel des champs prédéterminés organisent au moins partiellement l'information)."Textual data" means any information represented in the form of text. It can thus be scientific and technical information, as contained in patents, scientific articles. This information can be unstructured or structured (for example organized in a database, in which predetermined fields at least partially organize the information).

Par « entité », on désigne un niveau hiérarchique d'une entité nommée présentant plusieurs niveaux hiérarchiques. Ainsi l'entité nommée « Laboratoire X de l'université Y » comprend les deux entités « Laboratoire X » et « Université Y ». Une entité nommée qui ne présente qu'un seul niveau hiérarchique, par exemple « Société Z », ne comporte qu'une seule entité, « Société Z ». Pour conserver toute sa généralité à la présente description, on considérera que les entités nommées présentent nécessairement une pluralité de niveaux hiérarchique. Mais l'invention n'est nullement limitée à ce type particulier d'entités nommées, et les traitements proposés s'appliquent à tout type d'entités nommées, qu'elles comportent ou non une pluralité de niveaux hiérarchiques.By "entity" is meant a hierarchical level of a named entity having several hierarchical levels. Thus the entity named "Laboratory X of University Y" includes the two entities "Laboratory X" and "University Y". A named entity that has only one hierarchical level, for example "Company Z", has only one entity, "Company Z". To keep all its generality in the present description, it will be considered that the named entities necessarily have a plurality of hierarchical levels. However, the invention is in no way limited to this particular type of named entity, and the treatments proposed apply to any type of named entity, whether or not it comprises a plurality of hierarchical levels.

Un thésaurus conforme à l'invention prend la forme d'une base de données constituée d'une pluralité d'enregistrements, chaque enregistrement décrivant une entité déterminée. Un tel enregistrement comprend au moins les champs suivants :A thesaurus according to the invention takes the form of a database made up of a plurality of records, each record describing a specific entity. Such a record includes at least the following fields:

- un champ de désignation pour recevoir une forme normalisée de l'entité déterminée.- a designation field to receive a standardized form from the specified entity.

- un champ de texte pour recevoir au moins une chaîne de caractères extraite des données textuelles, et désignant l'entité déterminée.- a text field to receive at least one character string extracted from textual data, and designating the determined entity.

- et, le cas échéant, un champ de liaison pour recevoir un lien vers un autre enregistrement du thésaurus, cet autre enregistrement correspondant à une entité avec laquelle l'entité déterminée et hiérarchiquement reliée.- and, if necessary, a link field to receive a link to another record in the thesaurus, this other record corresponding to an entity with which the entity determined and hierarchically linked.

Pour chaque enregistrement du thésaurus, les champs de désignation et de texte sont nécessairement renseignés. Le champ de liaison est quant à lui renseigné dans la mesure où l'entité correspondante est hiérarchiquement reliée à au moins une autre entité, également enregistrée dans la base de données.For each record in the thesaurus, the designation and text fields are necessarily completed. The link field is filled in as far as the corresponding entity is hierarchically linked to at least one other entity, also recorded in the database.

Chaque entité peut être reliée à une ou à plusieurs autres entités. En conséquence, le champ de liaison peut comprendre un ou plusieurs liens, chaque lien pointant vers un autre enregistrement du thésaurus. Le lien peut être interprété comme la relation « appartient à ». D'une manière plus générale, la présence d'un lien entre deux enregistrements permet de déterminer la nature de la relation hiérarchique existant entre deux entités, c'est à dire laquelle des deux entités est hiérarchiquement supérieure à l'autre.Each entity can be linked to one or more other entities. Consequently, the link field can include one or more links, each link pointing to another record in the thesaurus. The link can be interpreted as the relation "belongs to". More generally, the presence of a link between two records makes it possible to determine the nature of the hierarchical relationship existing between two entities, that is to say which of the two entities is hierarchically superior to the other.

Le champ de désignation peut être défini comme une chaîne de caractères. Le champ de texte peut être défini comme au moins une chaîne de caractère. Cette chaîne, extraite des données textuelles, peut avoir reçu des traitements, c'est à dire qu'il ne s'agit pas nécessairement du texte brut directement extrait des données textuelles sources.The designation field can be defined as a character string. The text field can be defined as at least one character string. This string, extracted from textual data, may have received processing, that is to say that it is not necessarily plain text directly extracted from source textual data.

ίοίο

Les enregistrements formant la base de données peuvent également prévoir des champs additionnels. Ils peuvent ainsi comprendre un champ d'index permettant de facilement désigner et informatiquement manipuler les enregistrements de la base.The records forming the database may also provide additional fields. They can thus include an index field allowing to easily designate and computerize the records of the database.

Comme on demande, il est apparaisse sous textuelles sources l'a évoqué en introduction de cette en effet possible qu'une même entité différentes formes dans les données En rattachant ces différentes formes dans le champs texte d'un même enregistrement de la base de données constituant le thésaurus, on forme une base de données exhaustive et compacte.As we ask, it appears under textual sources evoked in the introduction of this indeed possible that the same entity different forms in the data By attaching these different forms in the text fields of the same record in the database constituting the thesaurus, an exhaustive and compact database is formed.

A titre d'illustration, les données textuelles d'origine sont une collection d'articles scientifiques. Les entités nommées correspondent au nom de l'organisation à laquelle sont affiliés les auteurs de ces articles. Un des documents de la collection fait apparaître le nom d'organisation « Surface du Verre et Interfaces, Unité Mixte de Recherche (UMR) 125, Centre National de la Recherche Scientifique/Saint-Gobain, Aubervilliers, France ». Après le traitement automatique qui sera détaillé par la suite de cette description, on dispose d'un thésaurus intégrant ces données et dont un extrait est représenté sur la figure 1.By way of illustration, the original textual data is a collection of scientific articles. The named entities correspond to the name of the organization with which the authors of these articles are affiliated. One of the documents in the collection shows the organizational name "Glass Surface and Interfaces, Joint Research Unit (UMR) 125, National Center for Scientific Research / Saint-Gobain, Aubervilliers, France". After the automatic processing which will be detailed later in this description, there is a thesaurus integrating this data and an extract of which is represented in FIG. 1.

Un premier enregistrement, d'index 1, comprend la désignation « Surface Verre Interface » comme forme normalisée de l'entité et est relié aux enregistrements d'index 2 et 3 qui vont être décrits. Dans l'exemple de la figure 1, le thésaurus présente également un champ de texte, qui pour cet enregistrement d'index 1 est renseignée par la forme traitée (par suppression des mots creux) de la chaîne de caractères initiale « Surface Verre Interfaces, Unité Mixte de Recherche ». L'enregistrement suivant, d'index 2, comprend la désignation « CNRS » comme forme normalisée de l'entité. L'existence d'un lien pointant sur cet enregistrement permet d'établir que l'entité « Surface Verre Interface » est d'un niveau inférieur à l'entité « CNRS ». Le champ de texte brut de d'index 2, comprend plusieurs expressionsA first record, of index 1, includes the designation “Surface Glass Interface” as a standardized form of the entity and is linked to the index records 2 and 3 which will be described. In the example of FIG. 1, the thesaurus also presents a text field, which for this index record 1 is filled in by the processed form (by deleting hollow words) from the initial character string "Surface Glass Interfaces, Joint Research Unit ”. The next record, index 2, includes the designation "CNRS" as the standard form of the entity. The existence of a link pointing to this record makes it possible to establish that the entity “Surface Glass Interface” is of a level lower than the entity “CNRS”. The plain text field of index 2, includes several expressions

1'enregistrement (qui ont pu être extraites des données textuelles à l'occasion de l'indexation de plusieurs articles scientifiques) « CNRS », « C.N.R.S », « Centre National Recherche Scientifique ».The recording (which could have been extracted from textual data during the indexing of several scientific articles) "CNRS", "C.N.R.S", "National Center for Scientific Research".

L'enregistrement d'index 3 est similaire à celui d'index 2, et par souci de concision, on ne détaillera pas son contenu.The record of index 3 is similar to that of index 2, and for the sake of brevity, we will not detail its content.

Indexation automatique des données textuelles sourcesAutomatic indexing of source text data

L'invention tire profit de cette structure de données pour proposer un procédé d'indexation automatique de données textuelles.The invention takes advantage of this data structure to propose an automatic indexing process for textual data.

Le procédé vise à établir un index des données textuelles traitées, aussi désigné « table d'indexation », c'est-à-dire une structure de données associant une portion des données textuelles (un document source, une entrée d'une base de données source, une balise dans des données textuelles non structurées) aux entités que cette portion contient. La table d'indexation permet d'identifier rapidement la ou les portion(s) des données textuelles qui comprennent une entité nommée déterminée. Bien entendu, la table d'indexation peut présenter d'autres entrées que celle correspondant aux portions de données textuelles et aux entités, mais la présente description sera limitée à ces deux champs par souci de simplification.The method aims to establish an index of the textual data processed, also called "indexing table", that is to say a data structure associating a portion of the textual data (a source document, an entry of a database source data, a tag in unstructured textual data) to the entities that this portion contains. The indexing table allows you to quickly identify the portion (s) of textual data that includes a specific named entity. Of course, the indexing table may have other entries than that corresponding to the portions of textual data and to the entities, but the present description will be limited to these two fields for the sake of simplification.

En référence à la figure 2, un procédé d'indexation automatique de données textuelles conforme à l'invention comprend une première étape d'extraction, d'une portion des données textuelles sources, d'une chaîne de caractères désignant une entité nommée pouvant comprendre plusieurs niveaux hiérarchiques. Sur la figure 2, cette chaîne de caractère est « XYZ, ABC » et représente une entité nommée constituée des deux entités « XYZ », et « ABC ».With reference to FIG. 2, a method of automatic indexing of textual data in accordance with the invention comprises a first step of extracting, from a portion of the source textual data, a character string designating a named entity which may include several hierarchical levels. In Figure 2, this character string is "XYZ, ABC" and represents a named entity consisting of the two entities "XYZ", and "ABC".

Lorsque les données textuelles sources sont structurées, par exemple sous la forme d'une base de données ou de marqueurs dans le texte, l'étape d'extraction consiste donc à requêter la base ou repérer le marqueur dans les données textuelles pour obtenir en retour la chaîne de caractères. Dans le cas où les données textuelles sources ne sont pas structurées, on pourra s'appuyer sur les méthodes connues de reconnaissance automatique d'entités nommées, par exemple celle associant l'exploitation d'une grammaire formelle, de modèles statistiques, et/ou de bases de données de catégories d'entités.When the source textual data is structured, for example in the form of a database or of markers in the text, the extraction step therefore consists in querying the base or locating the marker in the textual data in order to obtain in return the character string. In the case where the source textual data is not structured, we can rely on known methods of automatic recognition of named entities, for example that associating the use of a formal grammar, statistical models, and / or of entity category databases.

Le procédé d'indexation automatique se poursuit par une étape de traitement de la chaîne de caractères pour identifier au moins une entité, c'est-à-dire au moins un niveau hiérarchique de l'entité nommée. Ce traitement comprend la segmentation de la chaîne de caractères en un ou une pluralité de segments. Chaque segment, appelé segment d'entité dans la suite de cet exposé, correspond à une entité, c'est à dire à un niveau hiérarchique de l'entité nommé. L'étape de traitement de la chaîne de caractère conduit à établir une liste comprenant au moins un segment d'entité. Sur la figure 2, cette liste est [« XYZ », « ABC »], constituée des deux segments « XYZ » et « ABC ».The automatic indexing process continues with a step of processing the character string to identify at least one entity, that is to say at least one hierarchical level of the named entity. This processing includes segmenting the character string into one or a plurality of segments. Each segment, called entity segment in the rest of this presentation, corresponds to an entity, that is to say at a hierarchical level of the named entity. The step of processing the character string leads to establishing a list comprising at least one entity segment. In Figure 2, this list is ["XYZ", "ABC"], consisting of the two segments "XYZ" and "ABC".

Ce traitement peut comprendre une étape de délimitation d'au moins un segment d'entité par identification de signes de ponctuation dans la chaîne de caractère. En d'autres termes, il s'agit de découper la chaîne de caractères extraite à l'étape précédente en un ou une pluralité de segments, en formant l'hypothèse que les différents niveaux hiérarchiques de l'entité nommée sont séparés par de tels signes (virgules, point, etc) . La délimitation peut s'appuyer sur un autre critère, comme la détection d'un motif ou d'une amorce particulière dans la chaîne de caractères.This processing can include a step of delimiting at least one entity segment by identifying punctuation marks in the character string. In other words, it involves cutting the character string extracted in the previous step into one or a plurality of segments, by assuming that the different hierarchical levels of the named entity are separated by such signs (commas, period, etc.). The delimitation can be based on another criterion, such as the detection of a particular pattern or leader in the character string.

Le traitement peut également comprendre une étape de réécriture de la chaîne de caractères et/ou des segments pour remplacer certains termes rencontrés en une forme normalisée, ou pour tout simplement les éliminer notamment lorsqu'il s'agit de mots creux, comme des prépositions ou des articles.The processing can also include a step of rewriting the character string and / or the segments to replace certain terms encountered in a standardized form, or to simply eliminate them in particular when they are hollow words, such as prepositions or articles.

A titre d'exemple, lorsque les entités nommées sont de la catégorie des adresses, la chaîne de caractères « 4 rue Béridot, 38500 Voiron, France » pourrait être traitée au cours de l'étape de traitement pour fournir la liste de segments suivants [4; R. Béridot ; 38500 ; VOIRON; FR] . Outre la réorganisation des segments, on remarque que les termes « France » et « Rue » ont été réécrits sous les formes normalisées « FR » et « R.». On remarque également que le traitement s'est appuyé sur les signes de ponctuation et la détection d'amorces déterminées (du type « C+ Rue L+ » ou « CCCCC », où C représente un chiffre, C+ une séquence de chiffres, et L+ une séquence de lettres) pour délimiter correctement le numéro de rue et le code postal.For example, when the named entities are from the category of addresses, the character string "4 rue Béridot, 38500 Voiron, France" could be processed during the processing step to provide the list of following segments [ 4; R. Béridot; 38500; VOIRON; FR] . In addition to the reorganization of the segments, we note that the terms "France" and "Rue" have been rewritten in the standardized forms "FR" and "R.". We also note that the processing was based on the punctuation marks and the detection of determined primers (of the type "C + Street L +" or "CCCCC", where C represents a digit, C + a sequence of digits, and L + a sequence of letters) to correctly delimit the street number and the postal code.

D'une manière générale l'étape de traitement automatique pourra s'appuyer sur un lexique de traitement, associant des amorces types à des règles de délimitation et/ou de réécriture. Pour reprendre l'exemple précédent, le lexique de traitement permet d'identifier les amorces types « France » et « Rue » de la chaîne de caractères et de les réécrire sous leurs formes normalisées « FR » et « R.». Le lexique de traitement permet également d'identifier une séquence de 5 chiffres dans une adresse comme formant un code postal, cette séquence formant un segment d'entité, et les caractères suivant cette séquence comme un autre segment d'entité désignant la ville. Le lexique de traitement peut également comprendre une liste de mots creux. La règle de réécriture associée à ces mots creux consiste alors à les éliminer du segment d'entité. Pour être complet, et bien que cette caractéristique ne soit pas nécessairement exploitée au cours de l'exécution du procédé d'indexation, le lexique de traitement peut associer chaque amorce type à un niveau hiérarchique (par exemple associer le terme « France », à sa forme réécrite « FR » et au niveau hiérarchique « 1 ») , afin de pouvoir établir les relations hiérarchiques existant entre les différents segments identifiés. L'exploitation du lexique de traitement permet donc également d'identifier la relation hiérarchique pouvant exister entre les segments d'entité.In general, the automatic processing step can be based on a processing lexicon, associating standard primers with delimitation and / or rewriting rules. To use the previous example, the processing lexicon makes it possible to identify the primers types "France" and "Street" of the character string and to rewrite them in their standardized forms "FR" and "R.". The processing lexicon also makes it possible to identify a sequence of 5 digits in an address as forming a postal code, this sequence forming an entity segment, and the characters following this sequence as another entity segment designating the city. The treatment lexicon can also include a list of hollow words. The rewrite rule associated with these hollow words then consists in eliminating them from the entity segment. To be complete, and although this characteristic is not necessarily exploited during the execution of the indexing process, the processing lexicon can associate each standard primer to a hierarchical level (for example associate the term "France", to its form rewritten "FR" and at the hierarchical level "1"), in order to be able to establish the hierarchical relationships existing between the different identified segments. Exploitation of the processing lexicon therefore also makes it possible to identify the hierarchical relationship that may exist between the entity segments.

Quels que soient les traitements effectivement réalisés sur la chaîne de caractère désignant l'entité nommée, on dispose à l'issue de l'étape de traitement d'une liste comprenant au moins un segment d'entité.Whatever the treatments actually carried out on the character string designating the named entity, there is available at the end of the processing step a list comprising at least one entity segment.

En référence à la figure 2, le procédé d'indexation automatique se poursuit par une étape de recherche, dans le champ de texte du thésaurus d'entités, de chaque segment d'entité de la liste de segments de sorte à identifier, pour chaque segment, un enregistrement correspondant. A la figure 2, cette étape retourne les index d'enregistrement 1 et N, pour lesquels on retrouve effectivement dans le champ de texte les segments respectifs « XYZ » et « ABC ».With reference to FIG. 2, the automatic indexing process continues with a search step, in the text field of the entity thesaurus, of each entity segment of the list of segments so as to identify, for each segment, a corresponding record. In FIG. 2, this step returns the recording indexes 1 and N, for which the respective segments “XYZ” and “ABC” are actually found in the text field.

Dans les situations favorables, qui devraient être les plus nombreuses une fois le thésaurus suffisamment constitué, cette étape conduit à fournir pour chaque segment de la liste, l'enregistrement du thésaurus ou une référence à un enregistrement (un pointeur vers cet enregistrement, un numéro d'index de cet enregistrement) qui lui correspond.In favorable situations, which should be most numerous once the thesaurus is sufficiently constituted, this step leads to providing for each segment of the list, the thesaurus record or a reference to a record (a pointer to this record, a number of index of this record) which corresponds to it.

Les situations défavorables se produisent lorsque la recherche dans le thésaurus ne permet pas d'identifier un enregistrement correspondant à l'un des segments d'entité de la liste. Cette situation peut avoir plusieurs origines.Unfavorable situations occur when the search in the thesaurus does not identify a record corresponding to one of the entity segments in the list. This situation can have several origins.

Ainsi, le segment peut présenter une erreur, par exemple typographique ou orthographique. Afin de pallier cette situation, le procédé peut comprendre une étape de d'erreur. La sur la chaîne correction d'erreur de caractères bruts, traitement, mais préférentiellement peut avant elle correction s'appliquer l'étape de s'applique sur les segments d'entité de la liste de segments.Thus, the segment can present an error, for example typographic or orthographic. In order to remedy this situation, the method may include an error step. The on the chain error correction of raw characters, processing, but preferably can before it correction apply the step of applies on the entity segments of the list of segments.

Il existe de nombreuses méthodes possibles de correction d'erreur, mais avantageusement l'étape de correction d'erreur comprend une mesure de similarité entre un segment d'entité et les chaînes de caractères du champ de texte des enregistrements du thésaurus. Si, pour un enregistrement particulier du thésaurus, cette mesure de similarité est inférieure à un seuil déterminé, on considère que le segment d'entité comporte une erreur, et on le remplace par la chaîne de caractère du champ de texte de l'enregistrement particulier.There are many possible error correction methods, but advantageously the error correction step includes a measure of similarity between an entity segment and the strings of the text field of the thesaurus records. If, for a particular thesaurus record, this similarity measure is less than a determined threshold, the entity segment is considered to have an error, and it is replaced by the character string of the text field of the particular record .

peut être dans ce comme une exécutée avant cas exécutée étape de étape de souscan be in this as a executed before case executed step from sub step

L'étape de correction l'étape de recherche et systématiquement, par exemple l'étape de traitement. Alternativement, cette correction peut être exécutée après l'étape de recherche, uniquement dans le cas où cette étape de recherche n'a pas abouti pour un segment d'entité, c'est-à-dire que la recherche du segment d'entité dans le thésaurus n'a produit aucun enregistrement concordant.The correction stage the research stage and systematically, for example the treatment stage. Alternatively, this correction can be executed after the search step, only in the case where this search step was not successful for an entity segment, that is to say that the search for the entity segment in the thesaurus produced no matching records.

Dans certains cas, il est également possible que le thésaurus ne contienne pas une entité de la liste, même après avoir exécuté une étape de correction d'erreur sur le (ou les) segment d'entité. On peut choisir dans ce cas de mettre à jour le thésaurus par exemple par l'ajout d'un nouvel enregistrement correspondant à cette entité ou par le rattachement d'un segment à un enregistrement déjà existant du thésaurus. Cette situation de mise à jour du thésaurus, représentée en pointillée sur la figure 2, fera l'objet d'une description détaillée dans la partie de l'exposé ayant trait à la constitution du thésaurus. Dans une variante, également détaillée par la suite, la mise à jour du thésaurus est validé, éditée ou réalisé par un utilisateur.In some cases, it is also possible that the thesaurus does not contain an entity from the list, even after having performed an error correction step on the entity segment (s). In this case, you can choose to update the thesaurus for example by adding a new record corresponding to this entity or by attaching a segment to an already existing record in the thesaurus. This situation of updating the thesaurus, represented in dotted lines in FIG. 2, will be the subject of a detailed description in the part of the presentation relating to the constitution of the thesaurus. In a variant, also detailed below, the update of the thesaurus is validated, edited or carried out by a user.

sontare

Quelles que soient les enchaînées, on dispose étapes particulières qui se à l'issue de l'étape de recherche des références aux enregistrements du thésaurus correspondant aux segments d'entité de la liste, c'est à dire à chaque entité identifiée dans la chaîne de caractères.Whatever the chained, there are particular stages which are at the end of the stage of searching for references to thesaurus records corresponding to the entity segments of the list, ie to each entity identified in the chain of characters.

Dans une étape suivante, on forme une table d'indexation, c'est-à-dire que l'on relie la portion des données sources (ou une référence à cette portion) en cours d'indexation aux enregistrements identifiés dans le thésaurus à l'issue de l'étape précédente. Ainsi, et comme cela est représenté sur la figure 2, à l'issue de cette étape de liaison, on dispose d'une la table d'indexation qui peut présenter deux colonnes, la première colonne permettant de désigner la portion des données textuelles sources qui vient d'être indexée (document, entrée d'une base de données, etc), et une seconde colonne permettant de désigner les enregistrements du thésaurus identifiés au cours des étapes du procédé qui viennent d'être présentées. Chaque ligne de la table d'indexation associe donc une portion des données textuelles et au moins une référence à au moins une entrée du thésaurus d'entité.In a following step, an indexing table is formed, that is to say that the portion of the source data (or a reference to this portion) being indexed is linked to the records identified in the thesaurus to the outcome of the previous step. Thus, and as shown in FIG. 2, at the end of this linking step, there is an indexing table which can have two columns, the first column making it possible to designate the portion of the source textual data. which has just been indexed (document, entry of a database, etc.), and a second column allowing to designate the thesaurus records identified during the process steps which have just been presented. Each row of the indexing table therefore associates a portion of the text data and at least one reference to at least one entry from the entity thesaurus.

Lorsque l'on dispose, comme c'est généralement le cas, d'une pluralité de portions de données textuelles à indexer, on peut répéter le procédé d'indexation automatique qui vient d'être présenté pour chacune de ces portions, de sorte à former une table d'indexation reliant chaque portion de données textuelles aux différents niveaux hiérarchiques d'entités que cette portion contient.When there is, as is generally the case, a plurality of portions of textual data to be indexed, the automatic indexing method which has just been presented for each of these portions can be repeated, so as to form an indexing table connecting each portion of textual data to the different hierarchical levels of entities that this portion contains.

Identification automatiqueAutomatic identification

Selon l'invention, la table d'indexation peut être exploitée dans un procédé d'identification automatique d'une portion de données textuelles, parmi les données textuelles ayant servi de base à la constitution de 1'index.According to the invention, the indexing table can be exploited in a method of automatic identification of a portion of textual data, among the textual data having served as a basis for constituting the index.

Ce procédé d'identification, représenté sur la figure 3, comporte une première étape d'obtention d'une entité désignée sur laquelle doit porter l'identification. Il peut s'agir par exemple d'obtenir de la part d'un utilisateur la désignation de l'entité servant de base à la section des données qu'ils souhaitent opérer. Sur la figure 3, l'entité « ABC » a été fournie au cours de cette étape.This identification method, represented in FIG. 3, comprises a first step of obtaining a designated entity to which the identification must relate. This may, for example, involve obtaining from a user the designation of the entity serving as the basis for the section of data they wish to operate. In Figure 3, the entity "ABC" was provided during this step.

L'entité désignée peut présenter un niveau hiérarchique quelconque : pour poursuivre l'un des exemples précédents, l'utilisateur peut rechercher, lorsque les données textuelles sources sont constituées d'articles scientifiques, les articles pour lesquels l'un au moins des auteurs est affilié au CNRS (de niveau hiérarchique supérieur) ou rechercher les articles pour lesquels l'un au moins des auteurs participe à un laboratoire particulier du CNRS (de niveau hiérarchique inférieur)The designated entity can have any hierarchical level: to continue one of the previous examples, the user can search, when the source textual data consists of scientific articles, articles for which at least one of the authors is affiliated to the CNRS (higher hierarchical level) or search for articles for which at least one of the authors participates in a particular laboratory of the CNRS (lower hierarchical level)

Le procédé d'identification se poursuit par l'étape de recherche dans le champ de texte du thésaurus d'entités, de l'enregistrement correspondant à l'entité désignée. Sur la figure 3, cette étape fournie une référence à l'enregistrement d'index N du thésaurus.The identification process continues with the search step in the text field of the entity thesaurus, of the record corresponding to the designated entity. In Figure 3, this step provides a reference to the N index record of the thesaurus.

A l'aide de l'index de cet enregistrement, on peut retrouver, au cours d'une étape d'identification, dans la table d'indexation toutes les portions de données textuelles associées à cet enregistrement. Sur la figure 3, cette étape fourni une référence au document Dl, pour indiquer que ce document fait référence à l'entité désignée « ABC ».Using the index of this record, one can find, during an identification step, in the indexing table all the portions of textual data associated with this record. In FIG. 3, this step provides a reference to the document D1, to indicate that this document refers to the entity designated "ABC".

Ainsi en combinant l'exploitation du thésaurus d'entité et de la table d'indexation, il est possible de procéder à une recherche dans des données textuelles associées à une entité nommée pouvant présenter plusieurs niveaux hiérarchiques, en faisant porter la recherche sur n'importe quel niveau hiérarchique de l'entité nommée.Thus by combining the exploitation of the entity thesaurus and the indexing table, it is possible to carry out a search in textual data associated with a named entity which can have several hierarchical levels, by making the search relate to n ' no matter what hierarchical level of the named entity.

Élaboration automatique du thésaurusAutomatic thesaurus development

Comme on l'a vu, le thésaurus forme une base de données constituée d'une pluralité d'enregistrements d'entités. Chaque enregistrement associe pour une entité déterminée :As we have seen, the thesaurus forms a database made up of a plurality of entity records. Each record combines for a specific entity:

- le descriptif normalisé de l'entité déterminée dans un champ désignation ;- the standardized description of the entity determined in a designation field;

- au moins une chaîne de caractère, extraite des données textuelles, et désignant l'entité nommée, dans un champ de texte;- at least one character string, extracted from textual data, and designating the named entity, in a text field;

- et, le cas échéant, au moins un lien vers un autre enregistrement du thésaurus correspondant à une entité avec laquelle l'entité déterminée est hiérarchiquement reliée. Ce lien ou ces liens sont placés dans le champ de liaison.- and, where appropriate, at least one link to another record in the thesaurus corresponding to an entity with which the determined entity is hierarchically linked. This link or these links are placed in the link field.

L'invention propose un procédé d'élaboration automatique du thésaurus d'entités nommées, une entité nommée pouvant comprendre une pluralité d'entités présentant des niveaux hiérarchiques différents. Les étapes de ce procédé sont schématiquement représentées sur la figure 4.The invention proposes a method for automatically developing the thesaurus of named entities, a named entity being able to comprise a plurality of entities having different hierarchical levels. The steps of this process are schematically represented in FIG. 4.

Le procédé traite successivement des portions de données textuelles sources, par exemple des documents issus d'une collection de documents. Pour chaque portion, le procédé comprend une étape d'extraction, de la portion de données textuelle source, d'une chaîne de caractères désignant une entité nommée. Le procédé d'élaboration automatique comprend également une étape de traitement de la chaîne de caractères pour établir une liste comprenant au moins un segment d'entité.The method successively processes portions of source textual data, for example documents from a collection of documents. For each portion, the method comprises a step of extracting, from the source textual data portion, a character string designating a named entity. The automatic production method also includes a step of processing the character string to establish a list comprising at least one entity segment.

Ces étapes d'extraction et de traitement sont en tous points identiques à celle décrite dans le cadre de l'exposé détaillé du procédé d'indexation automatique, et représentées sur la figure 2. Par souci de brièveté leur description n'est pas renouvelée ici.These extraction and processing steps are identical in all respects to that described in the context of the detailed description of the automatic indexing method, and represented in FIG. 2. For the sake of brevity their description is not repeated here .

Pour établir un thésaurus compact et cohérent, et comme cela a été exposé précédemment, l'étape de traitement comprend avantageusement une étape de réécriture et/ou de correction d'erreur des segments d'entités.To establish a compact and coherent thesaurus, and as has been explained above, the processing step advantageously comprises a step of rewriting and / or correcting the error of the segments of entities.

À l'issue de ces étapes, on dispose donc d'une liste de segments, susceptibles pour certains d'entre eux d'être déjà présent dans le thésaurus. Le procédé d'élaboration automatique du thésaurus comprend donc une étape d'élimination de cette liste de segments d'entité, ceux déjà présents dans le thésaurus. Plus spécifiquement, et pour chaque segment de la liste de segments d'entité, on recherche leur présence dans le champ de texte du thésaurus. S'il existe un enregistrement correspondant dans le thésaurus, on ne retient pas le segment parmi la liste des segments nouveaux. On forme de la sorte une liste épurée, appelée « liste des segments nouveaux ». Sur la figure 4, cette liste est constituée du segment nouveau « ABC ».At the end of these steps, we therefore have a list of segments, some of which may already be present in the thesaurus. The process for automatically developing the thesaurus therefore includes a step of eliminating this list of entity segments, those already present in the thesaurus. More specifically, and for each segment of the list of entity segments, we look for their presence in the text field of the thesaurus. If there is a corresponding record in the thesaurus, the segment is not selected from the list of new segments. A refined list is thus formed, called "list of new segments". In FIG. 4, this list is made up of the new segment “ABC”.

Pour chaque segment nouveau de cette liste, on procède à la mise à jour du thésaurus.For each new segment of this list, the thesaurus is updated.

Il peut s'agir de la création d'un nouvel enregistrement dans le thésaurus, au cours d'une étape de création. La création d'un enregistrement consiste à renseigner dans la base de données formant le thésaurus les champs de désignation et de texte (en utilisant le segment d'entité repéré) et, le cas échéant, le champ de liaison.It can be the creation of a new record in the thesaurus, during a creation step. The creation of a record consists in filling in the database forming the thesaurus the designation and text fields (using the entity segment identified) and, if necessary, the link field.

Alternativement, il peut s'agir, au cours d'une étape d'association, d'associer un segment nouveau à un enregistrement préexistant. L'association ajouter la chaîne de caractère correspondant nouveau au champ de texte d'un enregistrement l'étape de correction d'erreur d'un segment peut révéler qu'un segment nouveau n'est pas suffisamment similaire à consiste à à un segment Par exemple, du champ enregistrement existant pour pouvoir être corrigé en cette forme, mais qu'il présente toutefois suffisamment de similarité pour que, en toute hypothèse, le segment nouveau la même entité que celle correspondant à cet Il est préférable dans ce cas de procéder à nouveau à cet enregistrement, à la création d'un nouvel une chaîne de caractère de texte d'un se réfère à enregistrement l'association du segment plutôt que de procéder enregistrement. Associer un segment nouveau à un enregistrement existant consiste à ajouter la chaîne de caractère correspondant au segment d'entité nouveau au champ de texte de l'enregistrement. Sur la figure 4, cette étape d'association a conduit à la mise à jour du champ de texte de l'enregistrement N, avec la chaîne « ABC ».Alternatively, it may be a question, during an association step, of associating a new segment with a preexisting recording. Adding the corresponding new character string to the text field of a record, the segment error correction step may reveal that a new segment is not similar enough to consists of a Par segment. example, of the existing record field to be able to be corrected in this form, but that however it presents enough similarity so that, in any hypothesis, the new segment the same entity as that corresponding to this It is preferable in this case to proceed to new to this record, to the creation of a new text character string of a refers to record the association of the segment rather than to proceed record. Associating a new segment with an existing record consists in adding the character string corresponding to the new entity segment to the text field of the record. In FIG. 4, this association step has led to the updating of the text field of the record N, with the string "ABC".

On ne peut garantir que le traitement entièrement automatique des données textuelles aboutisse de manière systématique à un thésaurus parfaitement cohérent. Ainsi, il est possible que certaines erreurs typographiques ne puissent être corrigées purement automatiquement. Dans le cas où il est important de disposer d'un thésaurus très cohérent, on peut prévoir que la mise à jour d'un nouvel enregistrement du thésaurus soit préalablement soumise à la validation, et à l'éventuelle édition, d'un utilisateur.It cannot be guaranteed that the fully automatic processing of textual data will systematically lead to a perfectly coherent thesaurus. Thus, it is possible that certain typographical errors cannot be corrected purely automatically. In the case where it is important to have a very coherent thesaurus, it can be provided that the updating of a new thesaurus record is subject to prior validation, and possible editing, by a user.

Ainsi, au cours de l'étape de création, un utilisateur peut préférer fournir une forme normalisée de l'entité (champ de désignation) différente de celle proposée (qui correspond au segment d'entité repéré). Par exemple, un utilisateur pourrait préférer utiliser la chaîne de caractère « CNRS » comme contenu du champ de désignation plutôt que la chaîne « C.N.R.S. » qui formerait un segment nouveau.Thus, during the creation step, a user may prefer to provide a standardized form of the entity (designation field) different from that proposed (which corresponds to the identified entity segment). For example, a user might prefer to use the character string "CNRS" as the content of the designation field rather than the string "C.N.R.S. Which would form a new segment.

Selon un second exemple, et manière plus substantielle, un utilisateur pourrait préférer associer un segment nouveau à un enregistrement existant plutôt que de créer un nouvel enregistrement. Par exemple, un utilisateur pourrait préférer associer un segment nouveau « ST Gobain » à un enregistrement préexistant dont les champs de désignation et de texte seraient « Saint Gobain ».In a second example, and more substantially, a user might prefer to associate a new segment with an existing record rather than creating a new record. For example, a user might prefer to associate a new segment "ST Gobain" with a preexisting record whose designation and text fields would be "Saint Gobain".

Selon un dernier exemple, un utilisateur pourrait souhaiter confirmer une association d'un segment nouveau à un enregistrement existant que propose automatiquement le procédé de l'invention. Il pourrait dans ce cas valider la proposition ou au contraire choisir d'associer le nouveau segment à un autre enregistrement du thésaurus.According to a last example, a user could wish to confirm an association of a new segment with an existing record which the method of the invention automatically proposes. In this case, he could validate the proposal or, on the contrary, choose to associate the new segment with another record in the thesaurus.

Les procédés d'élaboration, d'indexation et d'identification qui viennent d'être présentés sont appelés à être mis en œuvre par un ordinateur. L'invention porte donc également sur le ou les programmes d'ordinateur contenant des instructions adaptées à la mise en œuvre de chacune des étapes des procédés précités lorsque le programme est exécuté sur un ordinateur.The processes of preparation, indexing and identification which have just been presented are called upon to be implemented by a computer. The invention therefore also relates to the computer program or programs containing instructions adapted to the implementation of each of the steps of the abovementioned methods when the program is executed on a computer.

L'invention trouve une application particulièrement intéressante dans le domaine du traitement de l'information scientifique et technique (articles scientifiques, résumés de colloques, conférences, brevets, ...) ; mais peut également s'appliquer au traitement d'information de tout domaine applicatif, en particulier si cette information est disponible sous la forme de données textuelles, et qu'elle est susceptible d'incorporer des entités nommées présentant plusieurs niveaux hiérarchiques.The invention finds a particularly interesting application in the field of processing scientific and technical information (scientific articles, conference summaries, conferences, patents, ...); but can also be applied to the processing of information from any application domain, in particular if this information is available in the form of textual data, and that it is capable of incorporating named entities having several hierarchical levels.

Bien entendu l'invention n'est pas limitée à l'exemple de mise en œuvre décrit et on peut y apporter des variantes de réalisation sans sortir du cadre de l'invention tel que défini par les revendications.Of course, the invention is not limited to the example of implementation described and it is possible to make variant embodiments without departing from the scope of the invention as defined by the claims.

Ainsi, bien que l'on ait indiqué que certaines données pouvaient être organisées sous la forme de table, l'invention n'est certainement pas limitée à une telle structure de donnée, et elle peut être mise en œuvre indépendamment de la structure de donnée choisie. On pourra ainsi choisir de structurer les données sous la forme de listes, de listes indexées, de tables, d'objets, de bases de données ou de structure de toute forme, selon le besoin et la circonstance.Thus, although it has been indicated that certain data could be organized in the form of a table, the invention is certainly not limited to such a data structure, and it can be implemented independently of the data structure chosen. We can thus choose to structure the data in the form of lists, indexed lists, tables, objects, databases or any form of structure, depending on the need and the circumstance.

Claims

1. Method for the automatic production from textual data of a thesaurus of named entities, a named entity possibly comprising a plurality of entities of different hierarchical levels, the thesaurus forming a database made up of a plurality of entity records, each record combining for a specific entity:

- a standardized form of the specified entity;

- at least one character string extracted from textual data and designating the determined entity r

- and, where applicable, at least one link to another record in the thesaurus corresponding to an entity with which the determined entity is hierarchically linked;

the process of automatic development of the thesaurus comprising:

- a step of extracting, from the textual data, a character string designating a named entity;

a step of processing the character string to establish a list comprising at least one entity segment;

- a step of eliminating the list of entity segments already present in the thesaurus, to form a list of new segments;

- a step of updating the thesaurus for each new segment of the list of new segments.

2. Method according to the preceding claim wherein the processing step comprises a step of delimiting at least one entity segment by identifying punctuation marks in the character string.

3. Method according to one of the preceding claims wherein the processing step comprises a step of determining the hierarchical level associated with each entity segment.

4. Method according to one of the preceding claims wherein the processing step comprises a step of rewriting each entity segment.

5. Method according to one of the preceding claims, in which the processing step comprises the identification of standard primers in each entity segment, the standard primers being contained in a processing lexicon.

6. Method according to the preceding claim, in which the processing lexicon associates a standard primer with a hierarchical level and with a rule for rewriting the standard primer in an entity segment.

7. Method according to the preceding claim in which the standard primer is an empty word and the rewrite rule.

is entity. to delete the empty word in the segment 8. Process according to one of the claims previous including a step of correction of errors of segments entity.

9. Method according to one of the preceding claims wherein the step of updating the thesaurus comprises a step of creating a new record.

10. Method according to one of claims 1 to 8 wherein the step of updating the thesaurus comprises a step of associating a new segment with a preexisting record.

11. Method according to one of the preceding claims, in which the updating step is subject to the validation and / or to the prior modification of a user.

12. Method for automatic indexing of textual data comprising:

a step of extracting from the textual data a character string designating a named entity which may include a plurality of entities of different hierarchical levels;

a step of searching, in a thesaurus of named entities established using a method in accordance with one of the preceding claims, of a matching record for each entity segment of the list;

a step of linking, in an indexing table, the matching record or a reference to the record matching with the text data or a reference to the text data.

13. Method for automatic identification of textual data using an indexing table obtained according to a method according to the preceding claim, in which the identification method comprises:

- a step of obtaining a designated entity to which the identification must relate;

- a search step, in the entity thesaurus, of the record corresponding to the designated entity;

- an identification step, from the table

5 indexing, text data associated with the record corresponding to the designated entity.

14. Computer program containing instructions adapted to the implementation of each step of a

10 Method according to one of claims 1 to 13, when the program is executed on a computer.