WO2006013233A1

WO2006013233A1 - Method and device for automatic processing of a language

Info

Publication number: WO2006013233A1
Application number: PCT/FR2004/001692
Authority: WO
Inventors: Johannes Heinecke; Alain Cozannet
Original assignee: France Telecom
Priority date: 2004-07-01
Filing date: 2004-07-01
Publication date: 2006-02-09

Abstract

The invention concerns a method for automatically processing a language which consists in a syntactic analysis of a written text and a semantic analysis of the text to derive its meaning. The syntactic and semantic analyses are performed simultaneously.

Description

Method and device for automatically processing a language

The invention relates to the parsing of a language within the framework of the TALN (Automatic Processing of Natural Language). The automatic processing of a language is conventionally used to allow a computer to understand texts or requests formulated by users either in a written way, or vocally, in order to launch different services.

Such an analysis, which allows the comprehension of textual or vocal information, is generally necessary in order to shorten long documents without losing important information, to rephrase or paraphrase a text, to automatically translate a text or to search for adequate answers to a specific question, as is the case, for example, in search engines.

The automatic processing of written or vocal textual information conventionally uses a deep syntactic analysis followed by a generation of a semantic representation of the content of the text. Such a representation can then be the basis of an automatic translation into another language, the elaboration of a summary or an automatic classification of the text, etc.

In the state of the art, during the parsing of the text, no semantic information is accessible other than lexical semantic information related to the meaning of the words of the text. Indeed, machine translation tools, other than those based on the use of statistical theories, are based on a syntactic analysis to deduce the syntactic structure of the text, followed by a semantic or ontological analysis to determine the semantic structure of the text. It is then proceeded to the development of a pivot representation independent of the source language and the target language. From this representation, a semantic representation is developed taking into account the target language, then a syntactic structure to generate the necessary lexical forms of the translated text.

The information in lexical semantics is never sufficient to avoid the generation of semantically incorrect syntactic trees. Indeed, the presence of homonyms in an analyzed text generates the development of syntactic trees in a number corresponding to the number of meanings of each term, whereas only one of these trees corresponds to the exact meaning of the text. Thus, the use of a lexicon of very high capacity or the desire to cover a very wide or ambiguous domain risks generating a very large number of inconsistent results.

Thus, the automatic language processing techniques according to which a syntax analysis followed by a semantic construction is carried out in order to generate an ontologically correct representation of the content of a text have a certain number of disadvantages.

First, the search for completeness leads to the preservation of possibilities of low representativeness. The use of weighting methods or specialized lexicons makes it possible to reduce this disadvantage, but without addressing their causes.

Moreover, lexicons multiply lexical entries, according to more or less relevant criteria of meaning in order to take into account as many possible meanings as possible.

In view of the foregoing, the object of the invention is to provide a method and a device for analyzing a language making it possible to attribute to an utterance the meaning or meanings that it conveys, in the context in which it is located.

For this purpose, the invention proposes, according to a first aspect, a method of automatically processing a language by parsing a written text or utterance and semantic analysis of said text to deduce the meaning. According to the invention, simultaneous parsing and semantic analysis are carried out simultaneously.

Simultaneous parsing and semantic analysis make it possible to validate an inconsistent syntactic analysis, that is to say, asemantic or in conflict with basic ontological rules. These ontological rules are thus used during the linguistic analysis of a text in order to verify or to falsify a syntactic relation immediately after its creation, that is to say after the application of morpho-syntactic rules. The meanings of linguistic structures that contradict an ontological model can then be deleted.

By eliminating syntactic relationships that could not be verified ontologically, it is possible to lower ambiguities very early in a word processor. This decrease in data will consequently greatly increase the speed of text processing.

According to another characteristic of the method according to the invention, during the parsing, a syntactic tree is created consisting of a set of nodes each formed of a word or a phrase and each associated with a syntactic category and a dependency function connecting two nodes of said tree. In addition, during the development of the syntax tree, the text is semantically analyzed to validate the elaboration of a branch of the tree between two nodes linked by a syntactic function. According to another characteristic of the method, during the semantic analysis, an ontology is used which defines the concepts associated with each node and the roles linking these concepts, and one validates a branch of the syntax tree linking two nodes when a role allows a link between the concepts associated with the nodes. In an implementation mode, during the semantic analysis, an ontological representation of the text is developed by associating with each node of each pair of nodes linked by a A syntactic function is a set of variables comprising a concept translating the meaning of the node and an ontological formula linking said nodes, so as to elaborate the set of ontological formulas or representations linking the nodes of the text. We can also combine the ontological formulas between the nodes linked by a syntactic function so as to elaborate a global ontological formula for the text.

In a particular embodiment, automatic language processing comprises the steps of: associating, at each node, at least one concept translating the meaning of the node by interrogating a first database;

interrogating a second database in which syntactic rules are stored for establishing a syntactic relation linking a main node and a secondary node; and

- try to combine the concepts of the main node and the secondary node, the syntactic relationship being validated if the step in which one tries to combine the concepts succeeded.

Preferably, after validation of the syntactic relationship, the concept of the main node is replaced by the combination of the concepts of the linked nodes.

Similarly, after validation of the syntactic relationship, we can add the concept of the secondary node to an ontological formula of the main node.

According to yet another characteristic of the method according to the invention, provision is furthermore made for a step of morphological analysis of each node of the text to determine its shape by consulting a third database in which a lexicon of the nodes is stored. The subject of the invention is also a device for automatically processing a language for implementing a method as defined above, comprising a processing module associated with a first database in which an ontology is stored. which defines concepts associated with each node and roles linking these concepts and a second database (RS) in which syntactic rules are stored, the processing module comprising means for constructing a syntax tree comprising a set of binding branches two nodes each formed of a word or phrase and each associated with a syntactic category and a syntactic function of dependency connecting the two nodes of the branch that are extracted from the second database, each branch of the tree being validated when a role allows a link between the concepts associated with the nodes. According to another characteristic of the invention, this device further comprises a morphological analyzer associated with a third database in which a lexicon is stored.

Other objects, features and advantages of the invention will become apparent on reading the following description, given solely by way of nonlimiting example, and with reference to the appended drawings in which:

FIG. 1 is a block diagram of a device for automatically processing a language according to the invention; - Figures 2 to 5 illustrate the development of a formula or ontological relationship between nodes;

FIG. 6 is a flowchart illustrating the main phases of the method according to the invention; Fig. 7 is a diagram illustrating an example of an ontology; and

FIGS. 8 to 11 are diagrams illustrating an example of elaboration of an ontological representation of a text. FIG. 1 shows a block diagram of a device for automatic processing of a language according to the invention, designated by the general reference numeral 10. By automatic processing of a language, is meant, in the context of the present description, the syntactic and semantic analysis of a sentence or a text.

This device 10 is intended to develop, from a text to be analyzed, written or stated, a syntax tree using ontological knowledge. As can be seen in this figure, the device 10 essentially comprises a morphological analyzer 12 receiving, as input, a TX text to be analyzed and a processing module 14 ensuring the actual analysis of the nodes, that is to say the words or phrases of the text TX to elaborate the syntax tree. The morphological analyzer 12 is associated with a database in which a lexicon is stored in order to carry out a preliminary analysis of the TX text in order to perform a lexical search, a search of the forms of the nodes, and an identification of the fixed af The processing module 14 is, in turn, connected to a first ON database in which are stored concepts and ontological relations which define the ontology, and to a second database RS in which are stored syntax rules for associating each node with a syntactic category and a syntactic dependency function between two nodes.

With regard to the syntax analysis of the text, the processing implemented by the processing module is constituted by a conventional type of analysis, within the reach of a person skilled in the art. It will not be described in detail later.

It should be noted, however, that it consists of interrogating the database RS in order to establish a syntactic relationship between the nodes of the text, two by two, the ontological rules then being applied to validate each branch of a tree structure thus created.

It is then a question of validating a hypothesis H consisting of a syntactic dependence between two nodes consisting of a directed syntactic function linking a head node and a dependent node.

To do this, each node is associated with a quintuple Q formed by a set of attributes I ₁ , C, I ₂ , R, and F, such that each five quintuple Q is represented by the formula:

Q (I ₁ , C, I ₂ , R, F) in which:

- I ₁ designates an identifier of the quintuple;

It is a concept related to the head node;

I ₂ is another identifier associated with the concept C in the attributes F and R, if they are indicated; - R designates a role, ie a link or notion of access to the concept C of the head node; and

- F is an ontological formula linked to the node.

A concept defines the meaning attached to a node. For example, as described later with reference to an exemplary implementation of the invention, a "go" node may be attached to the concept

"Displacement" or "flight", depending on the pace of treatment.

A role defines access to a leading node. Thus, for example, an "airplane" dependent node may be linked to a "flying" head node by the "transport means" role.

The roles between the nodes are deduced from the ontology. For example, if a node A is a category of prepositions and B a nominal group, the role is either named from the lexical information on the preposition deduced from the parsing, or developed from the concept related to the dependent node.

Figures 2 to 5 illustrate different scenarios. In these figures, the continuous lines correspond to known elements, the discontinuous traits corresponding to elements or notions to discover. When the nodes and the roles linking these nodes are constituted only by continuous lines, it is a question of validating a hypothesis.

In Figures 2 to 5, the circles symbolize ontological concepts or formulas, the arrows representing a role having the concept as a target. These four cases summarize the communication with the means of elaboration of the syntax tree. It should be noted that in some cases the role orientation is not known. For example, in Figure 2, we should try to connect two concepts C by a role R to discover. With reference to FIG. 3, it may also be to link two concepts by a known role and thus to validate a hypothesis. It may still be necessary to assign a role R to a known concept (FIG. 4) or, as shown in FIG. 5, to search for two known role-concept sets, a concept accepting these roles and co-domains or targets ("range" in English) compatible with the concepts.

It should be noted that ontology modeling in the ON database is performed in such a way as to satisfy these conditions of use. This modeling deals in particular with the definition of roles in order to specify whether they form a specific entity, constrained by domains and co-domains and also to specify whether to create a new instance of a role when this role is specified within a concept. From the lexical point of view, the knowledge associated with words whose meaning is associated with roles, which is the case of prepositions, can be written as a list of pairs of pairs "name of the role-co-domain" or possibly be determinable. If multiple couples are allowed, both the role name and the subdomain must be written in the ontological object, which adds an interpretive rule to the object. Indeed, if the concept and role are determined, then the concept must be interpreted as the co-domain of the determined role.

The ontological analysis implemented by the processing module 14 during the parsing consists in particular of finding a role from two end nodes, under the constraint expressed by the introduced syntactic link, or to check the compatibility concepts of head and dependent nodes and a corresponding role with the ontological rules.

An example embodiment of a method according to the invention will now be described with reference to FIG.

This method begins with a first step 16 during which the morphological analyzer 12 proceeds to a reading of the database LX and during which the processing module 14 reads the databases RS and ON. In the next step 18, the morphological analyzer 12 proceeds to the morphological analysis of the TX text. It is essentially to carry out a lexical search, to identify the forms, affixes, ... During this step 18, the processing module 14 proceeds to attach one or more concepts C or R roles every word in the sentence.

In the next step 20, the RS syntax rules are applied to the sentence to establish a syntactic relationship between two words or nodes.

For each syntactic relation established between a main node and a secondary node, we try to validate the syntactic relation by combining each concept of the main node and each concept of the secondary node or, for a subsequent step, a partial ontological formula or representation (step 22 ).

If there is no other concept combination (step 24), and no concept combination is validated, the syntactic relationship is rejected (step 26). The procedure then returns to the previous phase. On the contrary, if there are other concept combinations, we consider a concept or the partial ontological formula of the main node and the secondary node (step 28). In the next step 30, it is checked whether the syntactic relation is validated by the ontology provided by using the concepts C of the words or ontological representations R already constructed of the nodes.

After validation, in the next step 32, it is determined whether the main node still has a simple concept or role. If this is the case, we replace this concept by the combination of the concept of the main node with the concept, the role or the partial ontological formula of the dependent node (step 34). In the opposite case, that is to say if the main node is already associated with a partial ontological formula, we add the concept or partial ontological formula of the dependent node to the partial ontological formula of the main node (step 36).

In the next step 38, it is checked whether there are other non validated syntactic relationships so as to continue processing with other syntactic-ontological rules. If there is no other rule that applies, then the sentence can not be further analyzed. The representation of the sentence is then delivered (step 40).

For example, we will now briefly describe the processing implemented within the processing module 14 during the analysis of the sentence in French "I would like to go from Paris to Madrid tomorrow". As we understand it, this sentence presents an ambiguity: indeed, the adverb "tomorrow" can be applied either to the verb "will" or to the verb "to go".

The morphological analyzer 12 has previously provided the following information:

word form concepts / roles I pronoun, ^1st person of the singular speaker would like verb modal, l ^era pers. singular wanting to go verb, infinitive to go, to fly from preposition departure from, to belong, object, to preposition arrives from, to, to, receives, belongs

Paris proper name Paris

Madrid proper name Madrid tomorrow adverb tomorrow

After morphological analysis, the TX text is processed by the processing module 14. The RS database provides a set of syntactical rules while the ON database provides the concepts, roles, and ontological relationships of the concepts for each node.

There is shown in Figure 7 an example of ontology usable for the treatment of the sentence in French "I would like to go from Paris to Madrid tomorrow". As it is conceived, the ontology represented in this figure has been extremely simplified, for the sake of clarity. In particular, dotted relationships between two C concepts indicate that concepts are missing between two represented concepts. In this figure are represented a set of concepts C as well as the roles linking two concepts. A dependent concept can be related to a head concept when this dependent concept specifies a generic notion associated with the head concept. They can also be linked by a role R.

For example, with regard to the concept of "displacement", this concept of head can be related to the concepts "travel by car", "journey by train", "fly", to the extent that these concepts constitute precisions of the concept generic "displacement".

Similarly, the concepts "place of arrival" and "place of departure" can be related to the concept of "displacement" by the roles "arrive from" and "share from", respectively. As indicated above, during the creation of the syntax tree by the processing module 14, the ontological representation is generated, the syntactic relationships that can not be validated by the ontology being immediately deleted. For example, while parsing would allow associating a person behind the preposition "de" to indicate possession, a syntactic branch thus created would be immediately deleted as no corresponding path exists in the ontology that provides for that the concept of "displacement" can only be associated with "place of departure" or "place of arrival" concepts by roles R "arrives from" and "part of" respectively.

There are shown in Figure 8 some Q quintupples used in the elaboration of the ontology.

Referring now to Figure 9, the syntactic analysis defines that the syntactic relation "subject" links the node "I" (secondary node) and the node "would" (head node). The concepts C associated with these nodes are respectively "speaker" and "will". The ontological validation is positive (step 22), that is to say that one can combine the two concepts to associate with the node "would want" a partial ontological formula which contains the ontological formulas "will (w)", " speaker (s) "and an additional role f (w, s) indicating who is the subject of the verb" want ".

Thereafter, the syntactic relations between "go" and "de", "à" and "tomorrow", "de" and "Paris" and "à" and "Madrid" are established and combined in order to obtain an ontological formula partial.

As can be seen in Figure 9, the roles "belongs" to the node "de" and "est_à" of the node "à" are not retained because the ontology does not allow a role "belongs" or a role "est_à". "From the concepts" fly "or" go ". A syntactic relationship between the words" would "and" go "is then established. Nodes no longer have a list of concepts but already a partial ontological formula. After validation of this syntactic relationship, the formulas ontological are combined as indicated above with reference to Figure 6 (step 36).

When validated syntactic relations have been established between all the words of the text, we obtain a complete ontological formula (figure 10).

It is possible to create additional syntactic trees if the syntactic rules permit, which would lead to the generation of other ontological formulas. In the example shown, the grammar and the lexicon are very small and do not allow any other solution.

Figure 11 shows the result of the entire process. As we see in this figure, the role "belongs" is not validated because the concept "go" and its more generic concepts do not have such a role in RS ontology. The syntactic relation between the modal verb "would" and the adverb "tomorrow" is also suppressed because the ontology has no role linking the concept "will" and the concept "tomorrow" or more generic concepts than "tomorrow". ".

Claims

1. A method of automatically processing a language by parsing a written text or utterance and semantic analysis of said text to deduce the meaning, characterized in that the parsing and the semantic analysis are performed simultaneously.

2. Method according to claim 1, characterized in that during the parsing, a syntactic tree is formed consisting of a set of nodes each formed of a word or a phrase and each associated with a category. syntax and a syntactic function of dependency linking two nodes of said tree, and in that during the development of the syntax tree, the text is semantically analyzed to validate the development of a branch of the tree between two nodes linked by a syntactic function.

3. Method according to claim 2, characterized in that during the semantic analysis, an ontology is used which defines the concepts associated with each node and the roles linking these concepts, and one branch of the node is validated. syntactic tree linking two nodes when a role allows a link between the concepts associated with the nodes.

4. Method according to any one of claims 1 to 3, characterized in that during the semantic analysis, an ontological representation of the text is developed by associating with each node of each pair of nodes linked by a syntactic function a set of variables including a concept translating the meaning of the node and an ontological formula linking said nodes, so as to elaborate the set of ontological formulas linking the nodes of the text.

5. Method according to claim 4, characterized in that the ontological formulas are combined between the nodes linked by a syntactic function so as to formulate a global ontological formula for the text.

6. Method according to any one of claims 1 to 5, characterized in that it comprises the steps of:

associating with each node at least one concept (C) translating the direction of the node by interrogating a first database (ON);

interrogating a second database (RS) in which syntactic rules are stored for establishing a syntactic relation linking a main node and a secondary node; and

7. Method according to claim 6, characterized in that after validation of the syntactic relationship the concept of the main node is replaced by the combination of the concepts of the linked nodes.

8. Method according to claim 6, characterized in that after validation of the syntactic relationship we add the concept of the secondary node to an ontological formula of the main node.

9. Method according to any one of claims 6 to 8, characterized in that it further comprises a step of morphological analysis of each node of the text to determine its shape by consulting a third database in which is stored a lexicon of nodes.

10. Device for automatically processing a language, for implementing a method according to any one of the claims

1 to 8, characterized in that it comprises a processing module (14) associated with a first database (ON) in which is stored an ontology which defines concepts associated with each node and roles linking these concepts and a second database (RS) in which syntactic rules are stored, the processing module comprising means for developing a syntax tree comprising a set of branches linking two nodes each formed of a word or a phrase and each associated with a syntactic category and at a a dependency function connecting the two nodes of the branch that are extracted from the second database, each branch of the tree being validated when a role allows a link between the concepts associated with the nodes.

11. Device according to claim 10, characterized in that it further comprises a morphological analyzer (12) associated with a third database (LX) in which is stored a lexicon.