WO2016068690A1

WO2016068690A1 - Method and system for automated semantic parsing from natural language text

Info

Publication number: WO2016068690A1
Application number: PCT/MY2015/050120
Authority: WO
Inventors: Benjamin Chu; Qiang Simon LIU; Dickson Lukose
Original assignee: Mimos Berhad
Priority date: 2014-10-27
Filing date: 2015-10-12
Publication date: 2016-05-06

Abstract

The present invention discloses semantic parsing method for use in natural language processing of an input; the method comprising: performing an entity recognition for extraction of at least one entity (102); performing a coreference resolution to resolve referents (103); and performing a semantic analysis (104) to generate semantic structures. In one embodiment, the semantic analysis (104) comprises: performing a semantic pre-processing (104 A) for deriving at least one main root verb for retrieval of at least one corresponding linguistic structure and; performing semantic filtering (104B) for selecting the best linguistic structure and merging of semantic structure to represent the input.

Description

METHOD AND SYSTEM FOR AUTOMATED SEMANTIC PARSING FROM

NATURAL LANGUAGE TEXT

FIELD OF INVENTION

[0001] The present invention generally relates to natural language processing, and more particularly to a method for generating semantic structures independent of any syntax structures.

BACKGROUND OF INVENTION

[0002] Semantics, commonly being defined as implied meaning of a particular subject, is a crucial component in understanding and interpreting a subject matter in the form of a natural language texts. To understand the meaning of the natural language texts, semantic parsing or syntactic analysis is typically performed. In general, semantic parsing involves linguistic based processing of text and transforming it into a conceptual representation of its meaning. [0003] One of the primary setbacks of semantic parsing and processing includes the presence of sematic variations and semantic ambiguities that are presence within natural language texts. Accordingly, if not interpreted accurately, can lead to the creation of multiple ambiguous meaning representations. At present, the existing methods and techniques have partially evolved but essentially include the use of Syntactic Analysis, thus heavily rely on the presence of syntax structures. In addition, the syntactic approach entails manipulations which

[0004] Hence it would be highly desirable to have a method and system that can provide accurate representations with respect to semantic meaning of a text, and independent of any syntax structures. SUMMARY [0005] In one aspect, there is disclosed a semantic parsing method for use in natural language processing of an input; the method comprising: performing an entity recognition for extraction of at least one salient entity; performing a coreference resolution to resolve referents; and performing a semantic analysis to generate semantic structures; wherein the semantic analysis comprises: performing a semantic pre-processing for deriving at least one main root verb for retrieval of at least one corresponding linguistic structure and; performing semantic filtering for selecting the best linguistic structure and merging of semantic structure to represent the input.

[0006] In one embodiment, performing a semantic pre-processing further comprises: extracting at least one token of lexical baseforms from the input and generate a vector list; identifying at least one verb type from the vector list; if the verb is an auxiliary verb type; discards all auxiliary words and extract the verb as it is and identifying a least important weight verb; if the verb is a lexical verb; transforming the verb into a lexical form, extracting the verb and identifying the least important weight; if the verb is a dynamic or stative verb; transforms the verb into its lexical form; extracting the verb and identifying the more important weight; searching all possible definition from a linguistic resource and identify a polysemy count for each verb and from all the identified verbs; a maximum weight verb is identified. [0007] In another embodiment, for finite, non-finite, regular, irregular, transitive and intransitive verbs; the method transforms the verb into its lexical form by performing inflection and extract the verb; whereby the least important weight.

[0008] In a further embodiment, the verb with a maximum weight is selected as the main root verb.

[0009] In yet a further embodiment, in the event that there is a plurality of main verbs identified; the method proceeds with selecting a main verb based on the highest polysemy count. [0010] In yet a further embodiment, in the event that the main root verb is identified, the method further comprises: retrieving all possible candidate linguistic structures from at least one linguistic structure repository based on the main root verb; and performing a semantic graph matching for each of the linguistic structure with the input semantic structure.

[0011] In another embodiment, the semantic filtering further comprises: identifying at least one subgraph attached to each verb identified and selected; checking whether all identified subgraph(s) are processed; if at least one subgraph is not processed, selecting said subgraph and iterate all concepts from the input; checking if each concept is conformed to a predefined semantic constraint to each of the concepts in the subgraph; if all concepts are conformed, adding a subgraph count; and merging the concepts and producing at least one new subgraph. [0012] In yet a further embodiment in the event that the predefined semantic constraints are not met, the method reverts to checking whether all subgraphs have been processed and repeating preceding steps.

[0013] In another embodiment, the method further comprising consolidating and merging all subgraph counts upon completion of iteration.

[0014] In yet a further embodiment, the method further comprises: if all subgraphs are processed, selecting a linguistic structure with the highest subgraph match count; and returning a merged semantic structure to represent the input based on the highest match count.

BRIEF DESCRIPTION OF DRAWINGS

[0015] The invention will be more understood by reference to the description below taken in conjunction with the accompanying drawings herein:

[0016] FIG. 1 shows the overall process flow of the method for use in natural language processing in accordance with an embodiment of the present invention; [0017] FIG. 2 shows the process flow for the semantic pre-processing in accordance with an embodiment of the present invention; [0018] FIG. 3 shows the process flow for semantic filtering in accordance with an embodiment of the present invention; [0019] FIG. 4 shows the process of matching and merging of the concepts from the input to the linguistic structure in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0020] In line with the above summary, the following description of a number of specific and alternative embodiments is provided to understand the inventive features of the present invention. It shall be apparent to one skilled in the art, however that this invention may be practiced without such specific details. Some of the details may not be described at length so as not to obscure the invention. For ease of reference, common reference numerals will be used throughout the figures when referring to the same or similar features common to the figures.

[0021] The present invention provides a method for generating semantic structures to represent the meaning of natural language texts without relying on any form of syntax structures. In one embodiment, the present invention resolves issues associated to complex syntactic structure, whereby the present invention entirely eliminates the use of syntactic analysis. In a further embodiment, the present invention utilizes at least one set of linguistic resources, a knowledge base and a series of semantic parsing processes to automatically generate semantic structures.

[0022] FIG. 1 depicts the overall process of the method for generating semantic structures in accordance with an embodiment of the present invention. The process starts and followed by 101, during which an input, which can be in the form of a sentence containing texts is subjected to an entity recognition (NER) process at step 102, to extract salient entities. During the entity recognition process, information and input for extraction can be obtained from a knowledge base 60. Similarly with the aid of the knowledge base 60 and then upon completion of the entity recognition step and thus successfully extracted at least one entity; the process proceeds to 103, for coreference resolution to resolve referents based on corresponding noun antecedents and precedents. The semantic analysis process follows at 104, whereby the semantic analysis process comprises a semantic preprocessing 104A, and then a semantic filtering at 104B. Upon completion of the semantic analysis, the output from the semantic analysis, which includes semantic structures, is generated at 105, and then overall process ends at 106. Accordingly, the linked data in relation to the entity recognition and coreference resolution processes are stored 80.

[0023] In accordance with an embodiment of the present invention, the semantic preprocessing 104A is performed primarily for generating and deriving the main root verb from the sentence, in order to retrieve all corresponding linguistic structures. The semantic filtering 104B then aids in selecting the best linguistic structure that can be applied to return the merged semantic structure and thus to represent the sentence.

[0024] The semantic pre-processing 104A of the semantic analysis 104 will now be described with reference to FIG. 2 in accordance with an embodiment of the present invention. Upon initiated, the process proceeds to extract all tokens of lexical baseforms from the text and incorporated into a vector list at 200. From the vector list, at least one verb type is identified at 201. During the identification of at least one verb, there can be three types of verbs to be identified; there are, but not limiting to; auxiliary verb; lexical verb and dynamic or stative verb. In the event that the verb is an auxiliary verb at 202; all auxiliary words are discarded and extract the verb as it is, whereby the least important weight within the extracted verbs is identified and may be assigned as W_e at 203. In the event that the identified verb is a lexical verb at 204, the verb is transformed into its lexical form by performing inflection and extracting the verb; whereby the least important weight is identified and may be assigned as W„ at 205. In the event that a dynamic or stative verb is identified at 206; the verb is transformed into its lexical form by performing inflection and extract the verb; whereby a more important weight is identified and may be assigned as WA at 207. As for finite, non-finite, regular, irregular, transitive and or intransitive verbs is identified, the verbs are transformed into its lexical form by performing inflection and extract the verb; whereby the least important weight W_c at 208. [0025] In one embodiment, for each of the verbs identified; all possible definitions are searched and retrieved from the linguistic resource or repository 70 and a polysemy count is assigned as P; at 209. From the W«,W6 and W_c; a verb with a higher or maximum weight assigned is selected at 210, whereby the selected verb is considered as the main root verb. In the event that there are more than one main verb within the sentence at 211; by referring to the Pi^* the verb with a higher polysemy count is selected at 212. With the selected main root verb, all possible candidate linguistic structures from the repository are retrieved at 213. .

Upon retrieved all linguistic structures from the repository, semantic graph matching is performed for each of the linguistic structure with the sentence or input semantic structure at 214. [0026] The overall process then continues to step 104B, the semantic filtering step.

The semantic filtering step 104B will now be described in accordance with an embodiment of the present invention with reference to FIG. 3. Following from the semantic pre-processing step, all subgraphs that are attached to verb from each of the candidate linguistic structures is accordingly identified at 300. Next, all subgraphs are checked to determined whether they are all processed at 301, to which if a NO response is received, the next subgraph is selected at 302. And then, for each identified and selected subgraph, iteration is performed through all the concepts from the sentence at 303. Each concept is checked whether each of them conforms to a predefined semantic constraint to each of the concepts in the subgraph at 304. If a NO is received as a response at 304 in determining whether the constraints are met, the process reverts to 301, where the process reverts to checking whether the subgraphs are processed at 301. Steps 302 onwards may then be repeated. In the event that YES is received as a response when checked whether all semantic constraints satisfied at 304, the subgraph count is added; for instance; SGi = 1 at 305. The concepts are then merged and at least one new subgraph is produced at 306. Then steps 302 to 306 may be repeated subject to the amount of subgraphs identified. In one embodiment, all subgraph counts are consolidated and merged thereafter upon completion of iteration at 306A.

[0027] In one embodiment, in the event that a YES is received as a response when checking whether all subgraphs are processed at 301, the linguistic structure with the highest subgraph match count is selected at 301A from the consolidated and merged subgraphs, and return the merged graph, being the finalized and merged semantic structure 90 to represent the input at 301B, thus ending the process at 307.

[0028] Referring back to FIG. 1, the finalized semantic structure 90 that suitably and accurately represents the input is generated and the overall method ends.

[0029] An example of a sentence subjected to the steps semantic pre-processing and filtering in accordance with an embodiment of the present invention is shown as EXAMPLE 1 below: EXAMPLE 1 [0030] Sentence: John bought Mary a Ferrari

[0031] During the pro-processing step, tokenized lexical baseforms can be extracted; these include; "John", "buy, "Mary", "Ferrari" where propositions or stopwords like the word "a" will be excluded for the semantic analysis.

[0032] Identifies the main root verb, whereby for this example the main root verb

"buy" and this verb type is a dynamic verb and hence, the more important weight, W_> is assigned. [0033] All possible linguistic structures for the identified main root verb are extracted; for instance; a linguistic structure defines how a verb is used in a certain way or it is structure pre-defined with fixed attachments (relations). The matching and merging process of the concepts from the sentence or input to the linguistic structure are simplified in a linear form as shown in FIG. 4.

[0034] Linguistic Structure #1: [animate] <- (agnt)<-[buy]->(thme)->[entity]

[0035] For linguistic structure #1, the structure can be explained as the agent of the action "buy" is an animate being (e.g person); and the theme of the action is referring to an entity.

[0036] Linguistic Structure #2: [animate]<-(agnt)<-[buy]-{(thme)->[entity];

(benf)-> [animate] ;} [0037] For linguistic structure #2, the structure can be explained in the similar way as described for linguistic structure #1 whereby the agent of the action "buy" is an animate person, and the theme of the action is referring to an entity, and in addition the beneficiary of the action is referring to an animate being (e.g. person). .

[0038] In this example, there can be many possible linguistic structures, which can be extracted from the linguistic resources for a particular verb. For instance in structure #2, there can be 3 subgraphs to be identified from the linguistic structure. [0039] And then performing the iteration step, the method iterates through all the concepts/instances from the sentence and perform matching to the concept nodes for each of the subgraphs ( subgraphs are determined from the linguistic structure before the matching process). For instance in this example; the iteration can be in the following form: [0040] Subgraph #1: [animate] <-(agnt)<- [buy]

Subgraph #2: [buy] -> (thme) -> [entity]

Subgraph #3: [buy] -> (benf) -> [animate]

[0041] Next the method proceeds to iterate all concepts/instances and perform semantic constraints check during the matching process, for instance; for "John", the method checks if this conforms to the first concept "animate" in the linguistic structure. From the knowledge base hierarchy, it can be known that John is an instance of a "male person" concept, whereby the concept "person" is lesser order than the "animate" concept. Upon completion of the conformity check, the instance "John" can be matched to the first node of the Subgraph #1.

[0042] Proceeding from the above, a new semantic structure (a subgraph) is produced: [male-person: "john"] <-(agnt)<-[buy]. The overall method is continued until all concepts are iterated and the semantic constraints are checked. Upon completion of the iteration 2, another semantic structure can be produced; such as: [buy]->(thme)->[car "Ferrari"].

[0043] Upon completion of the overall method, there can be an instance where the total subgraphs matched are higher than the other linguistic structures. For instance, in the event that the Linguistic Structure #1 as discussed in the preceding paragraph is selected, total subgraphs matched are two, whereas the Linguistic Structure #2 total matched subgraphs are three. Accordingly, checking against all possible linguistic structures is required to as to determine which structure has the highest matched subgraphs. Next, based on the best linguistic structure selected, a final and merged semantic structure can be produced, as per below: Subgraph #1:

[male-person: "John"] -> (agnt)<-[buy]

Subgraph#2:

[buy]->(benf)->[car:"Ferrari"l

Subgraph#3:

[buy]->(benf)->[female-person:"Mary"]

[0044] From the above, the three subgraphs can be merged to produce a final semantic structure as shown below: Merged Semantic Structure:

[male-person: "John"]<-(agnt)<-[buy]-{->(thme)->[car. "Ferrari"];

->(benf)-> [female-person :

"Mary"]}. [0045] Accordingly, in EXAMPLE 1, the final merged semantic structure produced represents the meaning of the text that "John bought a Ferrari for Mary" without having to experience the complexities of syntactic analysis or analysing the syntax structure. Perceptibly, the possible different variations of meanings and representations of a sentence, which eventually can cause ambiguities, can be avoided with the use of the method of the present invention.

[0046] As would be apparent to a person having ordinary skilled in the art, the afore- described methods may be provided in many variations, modifications or alternatives to existing methods and systems. The principles and concepts disclosed herein may also be implemented in various manners which may not have been specifically described herein but which are to be understood as encompassed within the scope of the following claims.

Claims

1. A semantic parsing method for use in natural language processing of an input; the method comprising: performing an entity recognition for extraction of at least one salient entity (102);

performing a coreference resolution to resolve referents (103); and performing a semantic analysis (104) to generate semantic structures; wherein the semantic analysis (104) comprises: performing a semantic pre-processing (104A) for deriving at least one main root verb for retrieval of at least one corresponding linguistic structure and; performing semantic filtering (104B) for selecting the best linguistic structure and merging of semantic structure to represent the input.

2. The semantic parsing method as claimed in Claim 1 wherein performing a semantic pre-processing (104 A) further comprises: extracting at least one token of lexical baseforms from the input and generate a vector list (200);

identifying at least one verb type from the vector list (201);

if the verb is an auxiliaiy verb type (202); discards all auxiliary words and extract the verb as it is and identifying a least important weight verb (203); if the verb is a lexical verb (204) ; transforming the verb into a lexical form, extracting the verb and identifying the least important weight (205); if the verb is a dynamic or stative verb (206); transforms the verb into its lexical form; extracting the verb and identifying the more important weight (207). O 2016/068690 , . „ , , . . _r , . . PCT/MY2015/050120 searching all possible definition from a linguistic resource and identify a polysemy count for each verb (209); and

identifying a maximum weight verb (210) from all the identified verbs.

3. The method as claimed in Claim 2, wherein in the event that the verb is identified as one of the following: finite, non-finite, regular, irregular, transitive and intransitive verbs; the method transforms the verb into its lexical form by performing inflection and extract the verb; whereby the least important weight (208).

4. The method as claimed in Claim 2 wherein the verb with a maximum weight is selected as the main root verb.

5. The method as claimed in Claim 2 wherein in the event that there is a plurality of main verbs identified (211); the method proceeds with selecting a main verb based on the highest polysemy count (212).

6. The method as claimed in Claim 2 wherein in the event that the main root verb is identified, the method further comprises: retrieving all possible candidate linguistic structures from at least one linguistic structure repository (213) based on the main root verb; and performing a semantic graph matching for each of the linguistic structure with the input semantic structure (214).

7. The method as claimed in Claim 1 wherein the semantic filtering (104B) further comprises: identifying at least one subgraph all subgraphs attached to each verb identified and selected (300);

checking whether all identified subgraph(s) are processed (301);

if at least one subgraph is not processed, selecting said subgraph (302) and iterate all concepts from the input (303);

checking if each concept is conformed to a predefined semantic constraint to each of the concepts in the subgraph (304); O 2016/068690 · . , , · , , PCT/MY2015/050120 it an concepts are conformed, adding a subgraph count (3l)5); and merging the concepts and producing at least one new subgraph (306).

8. The method as claimed in Claim 7, wherein in the event that the predefined semantic constraints are not met, the method reverts to checking whether all subgraphs have been processed and repeating steps (302) to (306).

9. The method as claimed in Claim 7, the method further comprising consolidating and merging all subgraph counts upon completion of iteration (306A).

10. The method as claimed in Claim 7, wherein the method further comprises: if all subgraphs are processed, selecting a linguistic structure with the highest subgraph match count (301 A); and

returning a merged semantic structure to represent the input based on the highest match count (30 IB).