WO2020079109A1

WO2020079109A1 - Device for automatically processing text by computer

Info

Publication number: WO2020079109A1
Application number: PCT/EP2019/078140
Authority: WO
Inventors: Amboise CADE; Henri FAUCHER DE CORN
Original assignee: Meremind
Priority date: 2018-10-18
Filing date: 2019-10-17
Publication date: 2020-04-23
Also published as: FR3087555A1

Abstract

A device for automatically processing text by computer which receives a sentence cut into tokens, and which uses a rectifier, a matching device, and a combiner in a repetitive manner, to produce a tree-shaped structure that describes the syntactic and semantic links of the sentence based on probabilistic, non-monotonic logic.

Description

Computer automatic word processor

The invention relates to the field of automatic processing of a text by computer, and more particularly to the field of natural language processing, or NLP (Natural Language Processing in English).

Natural language processing has two main branches: linguistic methods and automatic training methods.

The methods of the first branch are based on the theorization of languages from Chomsky's theories. However, this branch has only been the subject of largely manual and complex heuristic-based implementations and has never known a computer application giving satisfactory results for a general application.

The foundations of artificial intelligence based on automatic training were laid in the 1960s. In the last five years, as technology evolved, and with the explosion of the amount of data available to perform automatic training (ML for machine leaming in English, DL for Deep Leaming in English and NN for Neural Networks in English), this field has grown exponentially.

In general, automatic training is based on the determination by a machine of a statistical model determined on the basis of training, the parameters of which are fixed by the person who schedules the training, and on the basis of a training game. In practice, this means that the designer has mastered the principles of training and its parameters as well as the data on which training is based, but not the result, which is called the inference model. Thus, once the training is finished, it is the inference model which is used to make the predictions on the input data which one wishes to process, without the designer being able to see it other than as a black box. Applied to NLP, these methods are mainly based on vectorization of words, with Bag-of-words type models, or Word2Vec, and the treatment of these vectors as a problem of automatic training. These models are also based on a syntactic analysis sequence followed by a semantic analysis. Thus, the sentences are first cut syntactically, then a meaning is placed on the result.

However, the “black box” nature of the inference models produced is contradictory to the objective pursued in the NLP. Indeed, language has a meaning, it is even its foundation. In addition, this meaning is not expressed only by syntax or semantics, but by a combination of the two. For these reasons, and contrary to what one might think, the solutions of this second branch also rest on a large quantity of heuristics, which have the additional defect of being devoid of perceptible sense or logic, because they are created to satisfy the inference engine whose organization is not understood.

There is therefore a need for a stable automatic word processing device, functional in the most general sense of the term, and which is not based exclusively on heuristics.

The invention improves the situation. To this end, the invention provides an automatic text processing device by computer, comprising a memory arranged to receive text data to be analyzed in the form of tokens each comprising a character string and a unique token identifier, a base concept data associating character strings and concept identifiers, at least some of the concept identifiers being associated with each other, model lexical construction data and model structural construction data, each comprising one or more application conditions to a characteristic and one or more conclusions constituting elements to be applied to a characteristic, and an observation database associating at least two concept identifiers, a type of relation and an observation value indicating a probability of veracity of the type of relationship between the at least two concept identifiers, the dis positive being arranged to work so repetitive on a transient comprising lexical characteristics and structural characteristics produced by applying model lexical constructions and model structural constructions, the transient being initialized with lexical characteristics comprising for each token a concept identifier whose frequency is the most important in the concept database and which is associated with the token character string.

This device also includes:

a rectifier arranged to determine for each lexical characteristic of a transient a list of concept identifiers associated with the concept identifier of this lexical characteristic, to determine a set of observations corresponding to the concept identifiers of the lists thus determined, and to apply a non-monotonic probabilistic logical inference engine to determine the concept identifier of each list such that the observation values associated with these concept identifiers minimize a defined cost function by applying a multivalued logic operator to a or several rules drawn from the content of the characteristics of the transient and instantiated with corresponding observation values, and to replace the concept identifiers of the lexical characteristics of the transient with the concept identifiers thus determined,

a pairer arranged to determine among the model structural constructions those whose condition (s) apply to one or more of the characteristics of the transient, and to return the list of structural constructions with the characteristic (s) to which their conditions apply, the device being further arranged to classify the structural constructions associated with each characteristic by frequency of use, the first structural construction to be applied to the transient and the others forming a list of options, and

a combiner arranged to execute sequentially the selection of a structural construction to be applied to the transient, the storage of a copy of the transient with the list of options associated with the structural construction, and the application of the structural construction to be applied to the transient to the characteristic or characteristics of the transient to which this structural construction has been associated by the pairer, and to repeat this sequential execution on the transient thus modified with the structural construction to be applied to the next transient.

This device is also arranged to determine, after the execution of the combinator, if the characteristics of the transient product define a tree of which all the nodes are linked together and devoid of cycle, to return this tree if this is the case, and to repeat the execution of the rectifier, the pairer and the combiner on the last transient produced otherwise, and is also arranged, when the pairer does not return any structural construction to be applied to the transient, to replace the current transient with the copy of the most recent transient, and to execute the combinator with the first construction from the list of options as a structural construction to be applied to the transient. This device is particularly advantageous because it makes it possible to solve the problems described above. Indeed, it is entirely based on the application of an algorithm whose rules are linked to linguistics.

In various variants, the device may have one or more of the following characteristics:

- the non-monotonic probabilistic logic inference engine includes an optimizer using the algorithm of multipliers with alternating directions,

the device further comprises a filter arranged to determine, for each structural construction to be applied and the list of options associated with a set of rules, to determine a set of observations from the set of rules and the identifier (s) of concept associated with the characteristic to which the structural construction to be applied must be applied, and to apply the non-monotonic probabilistic logical inference engine with the rule set and the observation game in order to determine the structural construction to be applied to the transient and the list of options,

- the rectifier defines frequency rules from lexical characteristics of the transient, neighborhood rules from concept identifiers of the lexical characteristics of the transient and the concept identifiers associated with them in the concept database at a chosen distance , and structural rules drawn from attributes semantics of non-lexical characteristics of the transient linking two lexical characteristics,

- the combiner is arranged to store only the copy of the transient with the list of options associated with the structural construction, and in that, when the pairer does not return any structural construction to be applied to the transient, to replace the current transient with the copy of the most recent transient, and to execute the combiner with the first construction from the list of options as a structural construction to be applied to the transient, then repeating the application of the rectifier, the pairer and the combinator with the resulting transient,

- the combiner is arranged to store the remaining structural constructions to be applied to the transient at the same time as the copy of the transient with the list of options associated with the structural construction, and in that when the matcher does not return any structural construction to apply to the transient, to replace the current transient with the most recent copy of the transient, and to execute the combiner with the first construction from the list of options as a structural construction to apply to the transient as well as the remaining structural constructions, then by repeating the application of the rectifier, the pairer and the combiner with the resulting transient,

- the device is arranged to analyze the returned tree and to produce a semantic graph whose nodes are formed by the lexical characteristics and their concept identifier, and therefore the links are defined by the semantic attributes of non-lexical characteristics linking them two lexical characteristics, and

- the inference engine is arranged to apply a multivalued logic operator chosen from the group comprising the Lukasiewicz t-norm, the minimum t-norm and the Harnacher product.

The invention also relates to an automatic word processing method implemented by computer, comprising the following operations:

a) receive text data to be analyzed in the form of tokens each comprising a character string and a unique token identifier, a concept database associating character strings and concept identifiers, at least some of the identifiers of concept being associated with each other, construction data model lexicals and model structural construction data, each comprising one or more conditions of application to a characteristic and one or more conclusions constituting elements to be applied to a characteristic, and a database of observations associating at least two identifiers of concept, a type of relation and an observation value indicating a probability of veracity of the type of relation between the at least two concept identifiers,

b) initialize a transient which may include lexical characteristics and structural characteristics produced by applying model lexical constructions and model structural constructions, with lexical characteristics comprising for each token a concept identifier whose frequency is the most important in the database concept data and which is associated with the character string of the token, c) work repetitively on the transient by repeating the following successive operations:

cl) determine for each lexical characteristic of a transient a list of concept identifiers associated with the concept identifier of this lexical characteristic, to determine a set of observations corresponding to the concept identifiers of the lists thus determined, and to apply a non-monotonic probabilistic logical inference engine for determining the concept identifier of each list such that the observation values associated with these concept identifiers minimize a defined cost function by applying a multivalued logic operator to one or more rules taken from the content of the characteristics of the transient and instantiated with corresponding observation values, and to replace the concept identifiers of the lexical characteristics of the transient with the concept identifiers thus determined,

c2) determine among the model structural constructions those whose condition (s) apply to one or more of the characteristics of the transient, and to return the list of structural constructions with the characteristic (s) to which their conditions apply,

c3) classify the structural constructions associated with each characteristic by frequency of use, the first structural construction to be applied to the transient and the others forming a list of options, and c4) execute sequentially the selection of a structural construction to be applied to the transient, the storage of a copy of the transient with the list of options associated with the structural construction, and the application of the structural construction to be applied to the transient to the characteristic or characteristics of the transient to which this structural construction has been associated by the matcher, and to repeat this sequential execution on the transient thus modified with the structural construction to be applied to the following transient,

c5) to determine, after the execution of the combinator, if the characteristics of the transient produced define a tree of which all the nodes are connected between them and devoid of cycle, to return this tree if it is the case, and to repeat the operations cl ) to c5) on the current transient otherwise,

c6) if operation c2) does not return any structural construction to be applied to the transient, replace the current transient with the most recent copy of the transient, execute operation c5) with the first construction in the list of options as structural construction to be applied to the transient.

In various variants, the process may have one or more of the following characteristics:

- the application of the non-monotonic probabilistic logical inference engine includes the application of an optimizer using the algorithm of multipliers with alternating directions,

- the method further comprises, between operation c3) and operation c4):

c7) determine, for each structural construction to be applied and the list of options associated with a set of rules, determine a set of observations from the set of rules and the concept identifier (s) associated with the characteristic to which must be applied the structural construction to be applied, and apply the non-monotonic probabilistic logical inference engine with the rule set and the observation game to determine the structural construction to be applied to the transient and the list of options,

- operation c1) includes the definition of frequency rules from lexical characteristics of the transient, neighborhood rules from the concept identifiers of the lexical characteristics of the transient and the concept identifiers associated with them in the concept database at a chosen distance, and from structural rules drawn from semantic attributes of non-lexical characteristics of the transient linking two lexical characteristics,

- operation c4) store only the copy of the transient with the list of options associated with the structural construction, and in which, after the execution of operation cl), operations cl) to c7) are repeated on the current transient,

- operation c4) stores the remaining structural constructions to be applied to the transient at the same time as the copy of the transient with the list of options associated with the structural construction, and in which, after the execution of operation cl) , operation c5) is applied with the remaining structural constructions, then by repeating the application of the rectifier, then, operations cl) to cl) are repeated on the current transient, and

- the method further comprises the following operation:

d) analyze the tree returned by operation c6) and produce a semantic graph whose nodes are formed by the lexical characteristics and their concept identifier, and therefore the links are defined by the semantic attributes of non-lexical characteristics linking them two lexical features.

Other characteristics and advantages of the invention will appear more clearly on reading the description which follows, taken from examples given by way of illustration and not limitation, drawn from the drawings in which:

- Figure 1 shows a schematic view of a device according to the invention,

- Figure 2 shows an example of implementation of an automatic word processing function by the device of Figure 1,

- Figures 3 to 7 show examples of the implementation of operations in Figure 2,

FIGS. 8 to 17 represent representations of processing steps of the function of FIG. 2 on a simplified example,

- Figure 18 shows a tree returned by the device according to the invention, and

FIG. 19 represents a semantic graph drawn from the tree of FIG. 18. The drawings and the description below essentially contain elements of a certain nature. They can therefore not only serve to better understand the present invention, but also contribute to its definition, if necessary.

This description is likely to involve elements that may be protected by copyright and / or copyright. The rights holder has no objection to identical reproduction by anyone of this patent document or its description, as it appears in the official records. For the rest, he fully reserves his rights.

FIG. 1 represents a schematic view of a device according to the invention. The device 2 comprises a memory 4, a rectifier 6, a pairer 8, a filter 10, a combiner 12 and a validator 14.

In the context of the invention, the memory 4 can be any type of data storage suitable for receiving digital data: hard disk, hard disk with flash memory (S SD in English), flash memory in any form, random access memory, magnetic disk, locally or cloud distributed storage, etc. The data calculated by the device can be stored on any type of memory similar to memory 4, or thereon. This data can be deleted after the device has completed its tasks or stored.

In the example described here, the memory 4 receives permanent data which is used to implement the device 2 and can be enriched as it is executed. In the example described here, the memory 4 also receives temporary data or work data, which are generated for the needs of a given execution of the device 2, and which are not kept after this execution. The permanent data and the working data can be stored on the same memory 4 or on separate memories.

The permanent data includes a semantic concept database based at least in part on the Wordnet database (see https://wordnet.princeton.edu/) for the English language (other databases may be used for other languages). The permanent data also contains, and is not exhaustive, the following data, which will be defined in detail below:

- model lexical constructions for a given language, i.e. a condition based on the lexical category of a word (its Part of speech in English), and a conclusion allowing to characterize the lexical characteristic by its chain of character, its part of speech, its unique token identifier and a concept identifier,

- model structural constructions for a given language, that is to say one or more conditions which link a lexical or structural characteristic and another lexical or structural characteristic and reflect the syntactic structuring of the given language, such as the fact that an adjective is preposited or postposed, or the structure of a subordinate proposition of cause, etc.,

- filtering rules, which allow each to formulate a rule associated with a structural construction, in order to remove a lexical ambiguity thanks to the semantics, and

- an observation database, which are quadruplets associating a type of relation, two concept identifiers, and an observation value between 0 and 1 (0 signifying that the type of relation suggested is false, 1 that it is true, and the other values a probability that it is true or false). For example, for the type of relation Est_Un (A, B), the value of the observation of Est_Un (Einstein, man) would be 1, while the value of the observation of Est_Un (Einstein, dog) would be 0, and that that of Est_Un (Einstein, genius) would be 0.95. Thus, according to a concept of "closed world" (if the world is closed), if two concepts are not linked, then the observations connecting them are initialized to 0. This database of observations can be completed from the content of the concept database to deduce logical relationships between concepts. Observations could link more than two concept identifiers. However, this type of relationship can be reformulated by a set of several observations linking the concept identifiers two by two.

In the context of the invention, the rectifier 6, the pairer 8, the filter 10, the combiner 12 and the validator 14 are elements directly or indirectly accessing memory 4. They can be produced in the form of a appropriate computer code executed on one or more processors. By processors, it must be understood any processor adapted to the calculations described below. Such a processor can be produced in any known manner, in the form of a microprocessor for personal computer, of a dedicated chip of FPGA or SoC type (“System on chip” in English), of a computing resource on a grid, a microcontroller, or any other form capable of providing the computing power necessary for the embodiment described below. One or more of these elements can also be produced in the form of specialized electronic circuits such as an ASIC. A combination of processor and electronic circuits can also be envisaged.

FIG. 2 represents an example of the implementation of an automatic word processing function by the device in FIG. 1.

The role of the function in Figure 2 is fundamental. Indeed, it is a loop which calls the rectifier 6, the pairer 8, the filter 10, the combiner 12 and the validator 14 in order to gradually create a semantic graph which represents the meaning of the sentence received as input.

Thus, the device 2 offers with this function a fundamental brick in NLP because it is almost devoid of heuristics, and therefore offers a repeatable solution to create a language underlay which allows machines to access understanding semantics of texts. Device 2 finds a particularly effective application in the fields of "Question answering" (Answer to questions). Indeed, it allows first to analyze a text to establish its semantics, then to analyze a question in the same way to bring it closer to this semantics and provide the answer. More generally, the device 2 is based on an existing database of semantic concepts, but is capable of enriching it by virtue of its operation, unlike the methods based on automatic training.

The function in Figure 2 begins with an operation 200 with an Init () function. An example of implementation of the Init () function is explained with FIG. 3. The role of the Init () function is to initialize the loop of FIG. 2, and in particular to initialize the main object which is modified. by the loops in order to obtain the semantic graph which is the result. This object is here called “transient”, because it is called to evolve many times, very quickly, in order to generate the semantic graph.

As will be seen below, a transient is made up of characteristics. Each characteristic can relate either to a particular token, in which case it is called a lexical characteristic, or to a relationship between two lexical characteristics, in which case it is called a structural characteristic. A structural characteristic can itself relate to the relationship between other structural characteristics, which makes it possible to create complex semantic meanings.

This way of working also makes it possible to construct the semantic graph in a way that is both syntactic and semantic, which respects the nature of language. Finally, this will also make it possible to deal with the case of multilingual sentences. Indeed, if part of a sentence is not resolved in a given language, it can be analyzed with the characteristics of another language in order to identify a subset of sentence in a second language which gives meaning. semantics to a sentence that did not have one in the first language. This is completely new in the NLP field, and is completely inaccessible to solutions based on automatic training.

Characteristics, whether lexical or structural, are the result of the application of linguistic objects called constructions. These objects, well known in the field of linguistic analysis, have never found an effective computer application until today.

The constructions are based on a couple of condition (s) / conclusion (s). In other words, a construction is an object which behaves as follows: if lexical or structural characteristics meet the condition (s), then we will apply the conclusion (s) of the construction to them. Thus, each time a loop finds constructions which apply to the characteristics of the transient, it will complete them or create one or more additional characteristics in the transient if these do not already exist. The semantic graph will thus be constructed step by step, starting from the words and assigning them the meaning they have in the sentence, then by grouping them in nominal groups, verbal groups, then nominal and verbal sentences, etc., until the sentence is fully defined. As we will see, the function of Figure 2 is particularly powerful because it nevertheless allows to reevaluate during its execution the meaning associated with a word and to propagate the consequence of this change throughout the semantic graph.

The conditions and conclusions of the constructions consist of fields whose values are fixed or variable.

Conditions typically relate to one or more of the fields of one or more of the following types:

- Borders, which uses predicates of the same name to define the bounds of a group of elements,

- SubUnits, which receives a list of elements designated by the element,

- LexicalCategory, which defines a lexical category (for example, noun, verb, common noun, proper noun, adverb, article, adjective, etc.,

- ClasseLexicale, which defines a lexical class (for example, transitive verb, intransitive verb, auxiliary, determinant, etc.),

- CategoryPhrasale, which defines groups of words between them (for example nominal sentence, verbal sentence, etc.), and

- ClausalCategory, which defines a category of clause and allows groups of words to be grouped together, for example nominal sentences, verbal sentences, etc.

The conclusions typically relate to one or more of the fields of one or more of the following types:

- Referent, which defines a variable identifier which unites several elements at the same level,

- Args, which defines an argument constituting links between elements,

- Parent, which defines the parent element of the current element in the structure,

- Meaning, which defines a relational meaning (for example a temporal - simultaneous relation, time reference point, etc.), and

- SemanticClass, which defines a semantic class. Conditions and conclusions can have many other fields, such as Form (relative to an attribute form), String (a character string), Precede (a predicate linking two elements), Suit (a predicate linking two elements), Query (indicates that a group of elements includes a verb, a pronoun or an interrogative adverb), Preposition, TypeDePhrase (for example, nominal sentence, etc.), Number, Date, Person, ValenceSemantique (for example, actor and sound element identifier, etc.), Syntactic Valence (for example, subject or object and its element identifier), SyntacticFunction (for example adjective, auxiliary, verb, etc.), Time, Voice (active or passive), PassiveForm ( yes or no), etc.

For example, the "Compound" type construction can be defined as follows:

{

"id": "compound-names",

"score": 0.0,

"type": "Phrasal",

"description": "compound name rules",

"category": "in",

"group": "stagel",

"constructionClass": "Classic",

"locks": [

{

"name": "? nounl",

"comprehension" :

{

"name": "? nounl",

"map": {

"LexicalCategory": "[name]"

}

h

{

"name": "? noun2", "comprehension": {

"name": "? noun2",

"map": {

"LexicalCategory": "[name]"

}

h

{

"name": "? compound",

"comprehension": {

"name": "? compound",

"map": {

"Form": {

"name": "? v2",

"map": {

"Attached": ”[[? Nounl,? Noun2]]"

}

h

],

"conclusions": [

{

"name": "? compound",

"map": {

"Agreement": "? V5",

"Args": ^M [? V9,? V7] ",

"Referrer": ”? Ref ',

"SemanticClass": "identify",

"Parent": "? Vl0",

"Meaning": "[[attribute,? V9,? V6]]",

"SubUnits": ^M [? Nounl,? Noun2] // union "," CategoryLexical ":" [name, compound] " }

}

]

}

In this construction, the "? "Indicates that this is a variable, which can be found both in the conditions and in the conclusions, used to define the application of the construction, etc.

It should also be noted that the function of Figure 2 has a quasi-recursive nature. This makes it less easy to understand than a conventional sequential function and must be taken into account when reading the following. It is for this reason that figures 8 to 17 are provided. They do not in themselves provide teaching on the technique of device 2, but they allow a better understanding of how it explores all the possibilities for establishing the semantic graph. .

The Init () function begins with an operation 300 in which an array Tk [] is initialized by a Lex () function. The Lex () function performs the lexical analysis of a sentence received as input by device 2, and provides an array Tk [] in which the sentence is cut into standardized tokens. The Tk [] array stores for each token the corresponding character string and a unique identifier for this token in the sentence. This result is also stored in a table Tst_Stack [] which will be described below.

The Lex () function implements a lexical analyzer to produce a sequence of standardized tokens. The notion of standardization refers to the fact that certain words can be written in several forms (for example contractions in English), or that certain characters must be deleted or grouped. Thus the lexical analyzer performs one or more of the following functions:

- cleaning up annoying characters (footnote index, special characters, etc.),

- splitting of a text into sentences (thanks to delimiters such as the point, the exclamation point, etc.), - grouping of special expressions between them (dates, etc.),

- development of contracted words (for example "don’t" becomes "do not"),

- division of the sentence into standardized tokens for processing by the device 2.

The lexical analyzer is not the subject of the invention and a person skilled in the art knows several solutions in the state of the art for implementing it.

Then, in an operation 310, form predicates are initialized by an SFP () function which receives the table Tk [] as a variable. The SFP () function takes the table Tk [] and will produce predicates relative to the positions of the tokens relative to each other. Thus, for two tokens [balloon] [red], the SFP () function creates a predicate of type Attached ([balloon], [red]) and a predicate of type Precede ([balloon], [red]). These predicates therefore indicate that the token [balloon] is attached to the token [red] and that it precedes it directly. In the example described here, the Predicate () predicate is generated for all tokens downstream of an upstream token. The SFP () function is designed to also generate Borders () predicates, which indicate start and end indices of a chain of several tokens. The set of predicates thus produced is stored in memory 4, and is accessed to determine the application of conditions, as described below.

A loop is then launched to analyze each lexical characteristic of the transient in order to initialize the concept identifiers. At this stage, and this is the last time in the loop, we can identify characteristic of the transient and concept. Thus, in an operation 320, the transient Tst is unstacked, and in an operation 330, a Find () function determines, for the current token, the syntax identifier of the database of concepts of memory 4 which has the most important frequency, and stores it in an Ltk array [] For example, if the token is "red", the Find () function will return the syntax identifier associated with the adjective "red" rather than that associated with the color "Red" because the word is used more often as an adjective than as a noun. Simultaneously, the Find () function assigns the most frequent concept identifier among the concept identifiers associated with this syntactic identifier. Finally, the other syntactic identifiers are stored as options in an OC table [] These options will be stored in the Tst_Stack [] table in an operation 350 described below. When all the tokens have been treated in this way, the Ltk [] array is supplied as an argument to a LexConstr () function in an operation 340. The LexConstr () function returns a list of lexical constructions Cstr [] which will initialize the characteristics lexicals of the transient.

This is done in an operation 350 in which a Merge () function receives the list of lexical constructs Cstr [] from operation 340 and the transient Tst, and combines them. Here again, since this is the first operation, the combination is guaranteed, that is to say that the condition of each lexical construction of the list Cstr [] is necessarily fulfilled by a characteristic of the transient Tst, since they were chosen specifically for this. As noted above, optional lexical features are generated and stored in a Tst_Stack [] array. These options can be explored when a problem is identified. This is notably ensured by a function Bck () in an operation 290. This will be described in more detail with the description of the combiner 12 in relation to FIG. 7.

At output, the transient Tst is therefore initialized with the lexical characteristics corresponding to each token, with the unique token identifier and the concept identifier which has been determined to be the most likely. The predicates of type Attached () and Preceded () are also stored for the rest, and the function ends in an operation 399.

After the initialization operation 200, the loop of Figure 2 begins with an operation 205 in which a Max () function determines whether an output condition related to an excessive number of loop executions is fulfilled. This avoids getting stuck in a too long calculation loop (for example beyond 1000 iterations). When this condition is met, the function of Figure 2 ends in operation 299 with an error. Alternatively, operation 205 can be omitted.

Then a new loop begins. This loop begins in operation 210 with the execution of a Sem () function. The function Sem () is in the example described here implemented by the rectifier 6, and FIG. 4 gives an exemplary embodiment. From a general point of view, the goal pursued by the Sem () function is to analyze the current transient, which has just been enriched by the previous loop, and to see if it would not be appropriate to change one or several of the concept identifiers of the lexical characteristics taking into account the structural characteristics of the transient. In other words, the Sem () function "shakes" the bag of concept identifiers available for each lexical characteristic, in order to determine if there is not a new concept which would give more meaning to the sentence described by the. transient at this stage, from a semantic point of view.

Thus, in an operation 400, the rectifier 6 creates a Concept table [] which collects all the concept identifiers of the lexical characteristics of the current transient Tst.

Then, in an operation 410, an Extrap () function determines for each of these concept identifiers the list of concept identifiers which are linked to it in the concept database and stores each list in an entry in a Candid table [ ] Based on the Candid [] array, an Observ () function collects all observations related to each of the concepts in each list in the Candid [] array and groups them into an Obs [] array in an operation 420.

Finally, in an operation 430, an Infer () function uses the Candid [] table and the Obs [] table to modify the current transient, and the function ends with an operation 499.

More precisely, the function Infer () uses a non-monotonic probabilistic logical inference engine which calculates a multitude of cost functions as a function of the observations of the table Obs [] for each combination of a concept identifier by list of the table Candid [] In other words, the combination of lists creates a combinatorial of concept identifiers, and the observations associated with each concept identifier are used to calculate a cost function from them.

The cost function is performed by determining a plurality of rules from the transient. Then, these rules are evaluated on the basis of observations by applying a multivalued logic operator which allows to linearize the problem. In the preferred version of the invention, it is the Lukasiewicz t-standard. Alternatively, the operator could be the minimum t-standard or the Harnacher product.

These rules fall into three categories.

The first category of rules includes so-called frequency rules. They are based on the frequency of the concept identifier which is associated with the lexical characteristics present in the transient. These rules are expressed in the character CaractLex (Group (Car)) => Concept (Group (Car)). There is one rule per group of concept identifiers from the Extrap () function.

The second category of rules includes so-called neighborhood rules, which are based on the links between the concept identifiers of the lexical characteristics in the concept database. For these rules, the concept database is explored from each group of concept identifiers, and searches for “neighboring” concept identifiers in the other groups of concept identifiers, at a chosen distance. For example, if a first group contains the concept "Fred", and a second contains the concept "hold" (hold in English), then these concepts are at a distance of two concepts in the database of concepts. Indeed, "Fred" is a proper name, associated with the concept "human being", and the concept "human being" is itself linked to the "hold" capacity since human beings hold objects. Thus, when a connection is found between two concept identifiers of two distinct groups at a distance chosen from the concept database, a neighborhood rule is created. These rules are expressed in the form:

Charex (Group (Carl)) & Charex (Group (Car2)) & Link (Group (Carl), Group (Car 2)) => Concept (Group (Car2))

Finally, the third category of rules includes so-called structural rules, because they stem from the semantic links established between the lexical characteristics within the transient. These rules are therefore drawn from the semantic attributes of the characteristics resulting from the structural constructions which link two lexical characteristics together. For example, if it has been identified that the lexical characteristic associated with the chain "Fred" is linked to the lexical characteristic associated with the chain "holds" by a semantic attribute of type "actor", then a corresponding rule replaces the second category rule which linked these two lexical characteristics. These rules are therefore expressed in the form:

CaractLex (Group (Carl)) & CaractLex (Group (Car2)) & AttSem (Group (Carl), Group (Car2)) => Concept (Group (Car2))

The cost function instantiates these rules with the observations chosen as a function of the combinatorics of concept identifiers of each group resulting from the Extrap () function. The optimization of this cost function makes it possible to determine the combination of each concept identifier of each list which offers the best semantics for the current transient, the lexical characteristics of which are thus updated with the new concept identifiers which are considered to be more relevant.

It therefore appears that, when the transient contains only lexical constructions, there are only first category and second category rules, and the cost function is based on the co-occurrence of concept identifiers in the database of concept data. Then, as the structural constructions add semantic links between the lexical characteristics in the transient, third category rules, much more discriminating, are introduced in the cost function and will strongly constrain it.

The Applicant has discovered that the use of a non-monotonic probabilistic logical inference engine makes it possible for the first time to offer a satisfactory result for implementing a method based on linguistic constructions. Indeed, the Sem () function, thanks to the semantic adjustment it offers each time the loop is executed, is fundamental in obtaining a favorable result.

The Applicant has also discovered that it was particularly advantageous to use an inference engine including an optimizer using the algorithm of multipliers with alternate directions (or ADMM for “Altemating Direction Method of Multipliers "in English). Indeed, the use of such an optimizer makes it possible to reduce the computation time costs by linearizing the problem, whereas the basic problem is of type NP, that is to say a combinatorial of all the variants of each list between them, multiplied by the quantity of observations for each member of each list.

Once the semantic adjustment has been made, the loop continues with a loop which will test each of the model structural constructions on all the characteristics of the transient and determine which ones are likely to apply. For this, in an operation 220 a ConStr [] array of model structural constructions is unstacked, and in an operation 230, the structural construction c resulting from operation 220 is tested with all the characteristics of the current transient in a Match () function .

The Match () function is executed by the pairer 8 and FIG. 5 shows an example of implementation of this function. In general, the Match () function analyzes each of the conditions of the structural construction c and constructs the tuples of conditions which satisfy the conditions of the structural construction. Thus, in an operation 500 the list L [] of the conditions of construction c is retrieved by means of a Locks function [] Then, a loop is launched in which this list is unstacked in an operation 510 and the characteristics of the current transient are each compared to the current condition. For this, the transient is unstacked in an operation 520 and the corresponding characteristic f compared with the current condition 1 in an operation 530. If the characteristic f satisfies condition 1, then the following characteristic is tested by repeating the operation 520. If the characteristic f satisfies the condition, then an AddFt () function is executed in an operation 540. The AddFt () function adds in an array m [] all the groups of characteristics which satisfy a condition of the structural construction. Thus, when a characteristic satisfies condition 1, the function AddFt () determines those of the groups of the array m [] which are compatible with this characteristic taking into account all the conditions of the structural construction, and adds the characteristic f to all compatible feature groups. After operation 540, or if operation 530 is negative, then the loop resumes with the following characteristic in operation 520. When all the characteristics have been tested for the current condition 1, the loop is repeated with the following condition by repeating operation 510. When all the conditions have been tested, in an operation 550, a function Rem () reduces the table m [] to check the groups of product characteristics and keep only those which are complete, ie which fulfill all the conditions of structural construction. These groups therefore form tuples of characteristics which are likely to have the structural construction applied, then the function ends in an operation 599.

Note that the function of Figure 5 can be performed in many ways. For example, the loop could be performed so as to exclude the test of a characteristic as soon as it is detected that it does not satisfy a condition, for example by testing the value associated with it in the table m [] in start of loop. Other variants may be considered.

The characteristics which correspond to construction c are then stored in a table Con [] in an operation 235, then the loop resumes with operation 210. Once all the model structural constructions have been tested, the table Con [] is tested in a operation 240. If the table Con [] contains constructions, then operations 250 to 270 will analyze these constructions and choose the most relevant and establish a list of options for the case where the chosen constructions would lead to a dead end in the following loops . If this table is empty, then no structural construction can be applied to the characteristics of the transient. Since this test is inside a loop, this means that the semantic graph has not been fully resolved. The fact that the table Con [] is empty indicates that it is not possible to complete the semantic graph. It will therefore be necessary to explore the options established in the previous loop (s). This will be done in operation 290.

In operation 250, a function Ord () processes the table Con [] and produces two tables C2M [] and OC [] The table C2M [] contains the list of the most probable structural constructions, while the table OC [] contains the list of options. More precisely, the Ord () function initializes a first list by removing the first tuple from the array Con [] Then it iterates through all the other tuples in the array Con [], and, each time that an n- tuple concerns a characteristic of the first tuple, it introduces it into the first list and removes it from the table Con [] Once all the tuples have been browsed, the operation is repeated with the rest of the table Con [], until 'so that it is empty. This results in a C2M [] table containing the tuples which were used to generate the lists, and an OC [] table which contains the tuples which have been progressively added to each list, as options to the n- corresponding tuple from table C2M []

In operation 260, the C2M [] and OC [] arrays are processed in a Filt () function by filter 10 to determine if there is reason to believe that the frequency choice was not the right one. More precisely, in the case of tables C2M [] and OC [], we find ourselves in a situation where several structural constructions correspond to the same characteristic. In other words, there is a lexical ambiguity, and the Filt () function will try to resolve it by a semantic analysis. Figure 6 shows an example of implementation of the Filt () function.

The Filt () function is a loop which analyzes each conclusion of the C2M [] array by unstacking it in an operation 610. Then, in an operation 620, an Opt () function generates an array N [] which receives the current conclusion c and all the options corresponding to it in the OC [] table Then, in an operation 620, a Rules [] function generates an R [] table of analysis rules. For example in the case where two nominal groups are separated by a comma, it is necessary to determine whether it is a list or if it is an apposition. For this, the structural constructions corresponding to the list on the one hand and to the apposition on the other hand are translated into two rules which are stored in the table R [] with the concept identifiers attached to the characteristic concerned. These rules are in the example described here drawn from the filtering rules in memory 4. More precisely, the rules define predicates between the characteristics. However, if for all the constructions of the array N [] there is no rule, then nothing is introduced in the array R [] for the construction c. Then, in an operation 630, an observation table Obs [] is generated by an Observ () function in order to determine the observations related to the concept identifier of the characteristic concerned by the construction c and each of the rules of the table R [] Finally, in an operation 640, the inference engine is again applied to the array R [] and the array Obs [] If the array R [] is empty, then nothing is done and the order established with the arrays C2M [] and OC [] is maintained. Otherwise, the inference engine makes it possible to semantically determine which of the constructions is the most semantically relevant. This results in a table Con2 [] of the constructions to be applied to the current transient and a table OC [] of options. When all the constructions in table C2M [] have been processed, the function ends in operation 699.

Alternatively, the Filt [] function could be omitted. Once the Filt () function has been executed, the constructions chosen from the table Con2 [] are applied to the current transient in an operation 270 by the combiner 12 in a Merge () function. Fa figure 7 represents an example of implementation of the function Merge ().

Again, the table Con2 [] is unstacked in an operation 700. Then, in an operation 710, the table OC [] is also unstacked in order to store the corresponding options. Then in an operation 720, the current transient is stored with the options of operation 710 in an operation 720. This operation is crucial because it is this which will make it possible to traverse in the most complete and efficient way in operation 290 There is therefore a transient coupled to a list of construction options which is generated before each application of a construction, and as many fallback solutions in the event of failure. In other words, the transients introduced in the table Tst_Stack [] in operation 720 are all different from each other since between two applications of operation 720, the transient is modified in operation 730. Thus, the table Tst_Stack [] contains the detail of all the constructions applied to the transient, one by one, and classified in time by the very nature of the loop.

Finally, in an operation 730, the conclusions of the unstacked construction in operation 700 are applied to the current transient. For that, if these conclusions apply to an existing characteristic, then this one is updated. Otherwise, a new characteristic is created in the transient. When all the constructions have been applied to the transient, the MergeQ function ends in an operation 799. Once the Merge () function is finished, an operation 280 determines with a Goal () function if the sentence has been fully resolved and if the semantic graph is finished. For this, the Goal () function determines if all the tokens have been filled with a concept identifier, if all the characteristics define a tree with all the branches connected to each other, i.e. if it is possible to reach all the characteristics of the tree from each characteristic, and finally if there is no cycle in the generated structure. If this is the case, then the function of FIG. 2 ends in operation 299. Otherwise, the loop resumes with operation 205 to process the current transient.

Operation 290 consists of unstacking the table Tst_Stack [] supplied by the execution of operation 720, and resuming the loop in operation 270 using the first option.

In the foregoing, it therefore appears that the function in FIG. 2 is a systematic algorithm based on known data and almost devoid of heuristics (we could qualify the filtering rules as heuristics, but they are optional). This demonstrates the repeatable nature of the processing of the device 2. In addition, the processing of the device 2 is both syntactic by the application of structural constructions and semantic by the use of the rectifier 6 and the filter 10. It is this approach completely new, made possible by the use of a non-monotonic probabilistic logical inference engine which makes it possible to apply a model based on constructions and which produces a semantic understanding.

Figures 8 to 17 are given as an example to help better understand the operation of the loop in Figure 2 and in particular operation 290 on the phrase "Freds holds a small match".

In a first step represented in FIG. 8, the device 2 initializes the transient with the lexical constructions in the order of their frequency in the database of concepts: "Holds, verb""small,adjective","a, article de construction ”,“ match, verb ”and“ Fred, proper noun ”. As a variant, the lexical constructions could be classified by their order in the sentence, or in a chosen order so as to determine a degree of confidence in the choice of the first lexical constructions, and to place towards the end of the transient the lexical constructions for which the degree of confidence is the lowest.

In the next loop, shown in Figure 9, an adjective sentence was created above "small", a nominal sentence above "Fred", and two verbal sentences respectively above "holds" and "Match".

In the loop shown in Figure 10, a clause links the nominal sentence of "Fred" and the verbal sentence of "holds", then the subsequent loops fail. This failure dates back from the transients of the table Tst_Stack [] until determining that the error relates to the lexical construction of "match". As "Fred" is of rank lower than "match", the transient is reduced to figure 11. "match, common name" and "Fred, proper name" are then tested as an option, then figures 12 to 15 show the development constructions on this basis.

Finally, with figure 16 the verbal sentence is concluded and the structure is closed with an affirmative clause represented in figure 17. In order to better understand the rules of first, second and third category, these will be explained for the example Figures 16 and 17.

In Figure 16, only one semantic link is established between the lexical characteristics: the attribute link between "small" and "match". Thus, the set of rules produced will be:

CaractLex (Group (Fred)) => Concept (Group (Fred))

Charex (Group (hold)) => Concept (Group (hold))

C aractLex (Group e (a)) => Concept (Group e (a))

CaractLex (Group (small)) => Concept (Group (small))

CaractLex (Group (match)) => Concept (Group (match))

Charex (Group (Fred)) & Charex (Group (hold)) & Link (Group (Fred), Group (hold)) => Concept (Group (hold))

Charex (Group (hold)) & Charex (Group (match)) & Link (Group (hold), Group (my task)) => Concept (Group (match)) Charact (Group (small)) & Charact (Group (match)) & Attrib (Group (small), Group (match)) => Concept (Group (match))

In Figure 17, two other semantic links are established between the lexical characteristics: the actor link of "Fred" on "hold" and the theme link between "hold" and "match".

Thus, two third category rules are added and replace the second category rules in Figure 16:

Charex (Group (Fred)) & Charex (Group (hold)) & Actor (Group (Fred), Group (ho ld)) => Concept (Group (hold))

Charex (Group (hold)) & Charex (Group (match)) & Theme (Group (hold), Group (match)) => Concept (Group (match))

FIG. 18 represents the complete transient associated with the example of FIG. 17, and FIG. 19 represents the semantic graph which corresponds to it. Thus, it appears that the lexical characteristics are linked together by semantic attributes in non-lexical characteristics. This is shown in Figure 18 by the binding of elements of type? ArgXX and? RefYY or of type? ArgXX and? ArgYY.

Thus, the following links appear:

Actor? Arg27? Ref28, Theme? Arg27? Ref28, Attr? Argl5? Argl7, NonldentifïableReferent? Arg 2? Argl7.

FIG. 19 translates these semantic links, and makes it possible to represent the semantic graph which describes the meaning of the sentence, as produced by the device 2. The device 2 therefore produces a final transient which has a tree structure which contains all the both syntactic and semantic links of the entire sentence. This tree makes it possible to produce a semantic graph which gives the meaning of the sentence.

This is really fundamental because it is possible to automatically create, without human intervention, a semantic description layer of a text, which therefore becomes questionable. In addition, by the very nature of the device 2, this semantic description layer can be enriched incrementally, by providing new sentences, without having to redo all of the training. In addition, this layer is interrogable, and makes it possible to analyze what comprises the device 2, which makes it much more manipulable.

Claims

1. Device for automatic word processing by computer, comprising a memory (4) arranged to receive text data to be analyzed in the form of tokens each comprising a character string and a unique token identifier, a concept database associating character strings and concept identifiers, at least some of the concept identifiers being associated with each other, model lexical construction data and model structural construction data, each comprising one or more conditions of application to a characteristic and one or more conclusions constituting elements to be applied to a characteristic, and an observation database associating at least two concept identifiers, a type of relation and an observation value indicating a probability of veracity of the type of relation between the at least two concept identifiers, the device (2) being agen created to work repetitively on a transient comprising lexical characteristics and structural characteristics produced by applying model lexical constructions and model structural constructions, the transient being initialized with lexical characteristics comprising for each token a concept identifier whose frequency is the most important in the concept database and which is associated with the character string of the token, the device (2) further comprising:

a rectifier (6) arranged to determine for each lexical characteristic of a transient a list of concept identifiers associated with the concept identifier of this lexical characteristic, to determine a set of observations corresponding to the concept identifiers of the lists thus determined, and to apply a non-monotonic probabilistic logical inference engine to determine the concept identifier of each list such that the observation values associated with these concept identifiers minimize a defined cost function by applying a logic operator multivalued to one or more rules taken from the content of the characteristics of the transient and instantiated with corresponding observation values, and to replace the concept identifiers of the lexical characteristics of the transient by the concept identifiers thus determined,

a match (8) arranged to determine among the model structural constructions those whose condition or conditions apply to one or more of the characteristics of the transient, and to return the list of structural constructions with the characteristic or characteristics to which their conditions apply , the device (2) being further arranged to classify the structural constructions associated with each characteristic by frequency of use, the first structural construction to be applied to the transient and the others forming a list of options, and

a combiner (12) arranged to execute sequentially the selection of a structural construction to be applied to the transient, the storage of a copy of the transient with the list of options associated with the structural construction, and the application of the structural construction to be applied to the transient to the characteristic or characteristics of the transient to which this structural construction has been associated by the pairer (8), and to repeat this sequential execution on the transient thus modified with the structural construction to be applied to the transient next,

the device (2) being further arranged to determine, after the execution of the combinator (12), if the characteristics of the transient produced define a tree of which all the nodes are linked together and devoid of cycle, to return this tree if c 'is the case, and to repeat the execution of the rectifier (6), the matcher (8) and the combiner (12) on the last transient produced otherwise,

the device (2) being further arranged, when G pairer (8) does not return any structural construction to be applied to the transient, to replace the current transient with the copy of the most recent transient, and to execute the combiner (12) with the first construction from the list of options as a structural construction to be applied to the transient.

2. Device according to claim 1, in which the non-monotonic probabilistic logical inference engine comprises an optimizer using the algorithm of the multipliers with alternating directions.

3. Device according to claim 1 or 2, further comprising a filter (10) arranged to determine, for each structural construction to be applied and the list of options associated with a set of rules, to determine a set of observations from the set of rules and the concept identifier (s) associated with the characteristic to which the structural construction to be applied must be applied, and to apply the non-monotonic probabilistic logical inference engine with the rule set and the set of observation to determine the structural construction to be applied to the transient and the list of options.

4. Device according to one of the preceding claims, in which the rectifier (6) defines frequency rules from lexical characteristics of the transient, neighborhood rules from concept identifiers of the lexical characteristics of the transient and concept identifiers associated with them in the concept database at a chosen distance, and structural rules taken from semantic attributes of non-lexical characteristics of the transient linking two lexical characteristics.

5. Device according to one of claims 1 to 4, wherein the combiner (12) is arranged to store only the copy of the transient with the list of options associated with the structural construction, and in that, when the matcher (8) does not return any structural construction to be applied to the transient, to replace the current transient with the most recent copy of the transient, and to execute the combiner (12) with the first construction from the list of options as a structural construction to be applied to the transient, then by repeating the application of the rectifier (6), the matcher (8) and the combiner (12) with the resulting transient.

6. Device according to one of claims 1 to 4, in which the combiner (12) is arranged to store the remaining structural constructions to be applied to the transient at the same time as the copy of the transient with the list of options associated with the structural construction, and in that when the matcher (8) does not return any structural construction to be applied to the transient, to replace the current transient with the most recent copy of the transient, and for execute the combinator (12) with the first construction from the list of options as a structural construction to be applied to the transient as well as the remaining structural constructions, then repeating the application of the rectifier (6), the matcher (8 ) and the combiner (12) with the resulting transient.

7. Device according to one of the preceding claims, arranged to analyze the returned tree and to produce a semantic graph whose nodes are formed by the lexical characteristics and their concept identifier, and therefore the links are defined by the semantic attributes of non-lexical features linking two lexical features together.

8. Device according to one of the preceding claims, in which the inference engine is arranged to apply a multivalued logic operator chosen from the group comprising the Lukasiewicz t-standard, the minimum t-standard and the Harnacher product.

9. A computer-implemented automatic word processing method, comprising the following operations: a) receiving text data to be analyzed in the form of tokens each comprising a character string and a unique token identifier, a database of concepts associating character strings and concept identifiers, at least some of the concept identifiers being associated with each other, model lexical construction data and model structural construction data, each comprising one or more conditions of application to a characteristic and one or more conclusions constituting elements to be applied to a characteristic, and an observation database associating at least two concept identifiers, a type of relation and an observation value indicating a probability of veracity of the type of relation between the at least two concept identifiers, b) initialize a tran a site that may include lexical and structural features produced by applying model lexical constructions and model structural constructions, with lexical characteristics comprising for each token a concept identifier whose frequency is the most important in the database of concepts and which is associated with the character string of the token,

c) work repetitively on the transient by repeating the following successive operations:

cl) determine for each lexical characteristic of a transient a list of concept identifiers associated with the concept identifier of this lexical characteristic, to determine a set of observations corresponding to the concept identifiers of the lists thus determined, and to apply a non-monotonic probabilistic logical inference engine for determining the concept identifier of each list such that the observation values associated with these concept identifiers minimize a defined cost function by applying a multivalued logic operator to one or more rules drawn from the content of the characteristics of the transient and instantiated with corresponding observation values, and to replace the concept identifiers of the lexical characteristics of the transient by the concept identifiers thus determined, c2) determine among the model structural constructions those whose or the C onditions apply to one or more of the characteristics of the transient, and to return the list of structural constructions with the characteristic or characteristics to which their conditions apply,

c3) classify the structural constructions associated with each characteristic by frequency of use, the first structural construction to be applied to the transient and the others forming a list of options, and

c4) execute sequentially the selection of a structural construction to be applied to the transient, the storage of a copy of the transient with the list of options associated with the structural construction, and the application of the structural construction to be applied to the transient to the characteristic or characteristics of the transient to which this structural construction has been associated by the pairer (8), and to repeat this sequential execution on the transient thus modified with the structural construction to be applied to the following transient,

c5) determine, after the execution of the combinator (12), if the characteristics of the transient product define a tree of which all the nodes are linked together and devoid of cycle, to return this tree if it is the case, and to repeat operations cl) to c5) on the current transient otherwise,

10. The method of claim 9, wherein the application of the probabilistic non-monotonic logic inference engine comprises the application of an optimizer using the algorithm of multipliers with alternating directions.

11. The method according to claim 9 or 10, further comprising, between operation c3) and operation c4):

c7) determine, for each structural construction to be applied and the list of options associated with a set of rules, determine a set of observations from the set of rules and the concept identifier (s) associated with the characteristic to which must be applied the structural construction to be applied, and apply the non-monotonic probabilistic logical inference engine with the rule set and the observation set to determine the structural construction to be applied to the transient and the list of options.

12. Method according to one of claims 9 to 11, in which operation c1) comprises the definition of frequency rules from characteristics lexicals of the transient, neighborhood rules from the concept identifiers of the lexical characteristics of the transient and the concept identifiers associated with them in the concept database at a chosen distance, and structural rules derived from semantic attributes of non-lexical characteristics of the transient linking two lexical characteristics to each other.

13. Method according to one of claims 9 to 12, in which the operation c4) store only the copy of the transient with the list of options associated with the structural construction, and in which, after the execution of the operation cl), operations cl) to c7) are repeated on the current transient.

14. Method according to one of claims 9 to 12, in which operation c4) stores the remaining structural constructions to be applied to the transient at the same time as the copy of the transient with the list of options associated with the structural construction, and in which, after the execution of operation cl), operation c5) is applied with the remaining structural constructions, then repeating the application of the rectifier (6), then, operations cl) to c7) are repeated on the current transient.

15. Method according to one of claims 9 to 14, further comprising the following operation: