CN115186671A - Method for mapping noun phrases to descriptive logic concepts based on extension - Google Patents

Method for mapping noun phrases to descriptive logic concepts based on extension Download PDF

Info

Publication number
CN115186671A
CN115186671A CN202210530158.4A CN202210530158A CN115186671A CN 115186671 A CN115186671 A CN 115186671A CN 202210530158 A CN202210530158 A CN 202210530158A CN 115186671 A CN115186671 A CN 115186671A
Authority
CN
China
Prior art keywords
concept
word
noun
noun phrases
concepts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210530158.4A
Other languages
Chinese (zh)
Inventor
瞿裕忠
宋鼎
丁文韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210530158.4A priority Critical patent/CN115186671A/en
Publication of CN115186671A publication Critical patent/CN115186671A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

A method for mapping noun phrases to description logic concepts based on extension, which first exhausts all text segments of noun phrases and generates a mapping table of the text segments to resources in a knowledge base; then generating an analysis sequence according to the word segmentation, part-of-speech tagging and a syntax tree of the noun phrase; finally, from the concept of EL + + in the order of resolution
Figure DDA0003646159520000011
And initially, continuously refining the basic concept generated by the indexed resources until all words are analyzed to obtain the description logic concept mapped by the noun phrases. The invention can automatically process the high-quality description logic concept which can be generated by the complex noun phrases with implicit relations through the analysis of the syntax tree.

Description

Method for mapping noun phrases to descriptive logic concepts based on extension
Technical Field
The invention belongs to the technical field of computers, relates to natural language processing and knowledge graph technology, and discloses a method for mapping noun phrases to descriptive logic concepts based on extension.
Background
Solving natural language by a computing mechanism is always a goal which is constantly pursued by scientific researchers in the field of natural language processing. The semantic parsing task aims at converting natural language text into a meaning expression language which can be understood by a computer, and is one of the most difficult problems in the field of natural language processing. This task has been of interest to many researchers since its introduction due to the complexity and ambiguity of natural language. The rise of the knowledge graph makes the work of connecting the natural language with the knowledge graph have more key significance.
In natural language, noun phrases (Noun phrase) refer to a class of phrases whose grammatical functions are equivalent to nouns, and Noun phrases widely appear in various linguistic data, so that understanding Noun phrases has important meaning, and a good Noun phrase parser can also become a component of other natural language processing work. However, at present, semantic parsing work and KBQA work implemented by the semantic parsing method usually take sentences or chapters as research units of natural language, and the research on noun phrases is performed with little pertinence. Relationship information in noun phrases often appears implicitly, such as the semantics of "American songwriters" being "songwriters write born in United States" or "songwriters write citizenship is United States", which is easily understood by humans, but for computers, information of nationality or place of birth cannot be obtained directly from phrase text. In part of the work, in order to save the labor for labeling the training data, the extension is selected and used as the training data of the weakly supervised learning. An extension is a concept relative to connotation, consisting of what the phrase applies to. For the question-answering task, the extension is the set of answer entities of the question sentence. In part of the work, the extension is used as a supplement of training data, and a statistical index based on extension information can be used as a training feature, so that reference is provided for determination of an implicit relation in the training process. However, such semantic parsing based on supervised learning or weakly supervised learning requires a training data set of a certain size to generate a model through training. Currently, training data sets for authoritative, public supervised learning directed specifically to noun phrases have not emerged. How to use a more lightweight approach to achieve phrase-specific understanding deserves discussion and research.
On the other hand, some related work has emerged on the task of mapping noun phrases to knowledgegraphs using epitaxy. These works, which have been the subject of research on the wikipedia category, give some characteristics, by statistical measures, that are consistent with the entities described by the wikipedia category, since the set of entities described is simply available. The Cat2Ax extracts a matching mode from the hierarchical structure in the Wikipedia category, selects the Axiom (Axiom) with the highest score according to the statistical index and the lexical score comprehensive score, and further generates a new triple to complement the knowledge base; pasca et al, which uses complex noun phrases as a combination of head type and modifiers, first identifies the head in the phrase, then divides the rest of the phrase into a number of modifiers, and selects, by statistical measures, the interpretation of the other modifiers with the known interpretation of the head. Generally, the existing methods regard noun phrases as combinations of modifiers, and simple concatenation is performed after respective explanations, so that the complex noun phrases containing nested relationships cannot be processed.
Since there may be more complex noun phrases, a semantic representation with a stronger expressive power is needed to describe them. The description logic mainly describes the concept and the attribute of the ontology, provides a convenient expression form for the construction of the knowledge graph, and is widely applied to ontology reasoning work. The description logic language EL + + has the calculation complexity of the inference of polynomial time, and is light in weight while better expression capacity is reserved. The EL + + logical form can be defined recursively as:
Figure BDA0003646159500000021
wherein,
Figure BDA0003646159500000025
for the top-level concept name set, A represents an atomic concept, i.e. a concept name, such as Film; r represents an atomic role, i.e., a role name, such as basedOn; o is the name of the individual, such as Alice Munro; c 1 And C 2 Is a general concept. That is, in EL + +, concept C is formed by extracting atomic concept A and atomic character r
Figure BDA0003646159500000022
There are constraints
Figure BDA0003646159500000023
Generated as a constructor. For ease of understanding, the concept in describing logic EL + + will be referred to as describing the logic concept.
In summary, an efficient and effective method for mapping phrases to logical forms based on a specific knowledge graph using epitaxy is of great significance.
Disclosure of Invention
The invention aims to solve the problems that: the existing semantic parsing work needs a large amount of training data, because of the lack of extension and the lack of data sets, the parsing effect on noun phrases in the prediction stage is poor, and the existing method for mapping noun phrases to knowledge graph by using extension cannot process complex noun phrases with nested relation. The invention aims to provide a method for quickly and comprehensively understanding noun phrases through extension, in particular to a method for mapping noun phrases to EL + + description logic concepts.
The technical scheme of the invention is as follows: a method for mapping noun phrases to descriptive logical concepts based on extensions, the noun phrases being mapped by the extensions of the noun phrases to logical language concepts expressed in a descriptive logical language EL + +, generating an understanding of the noun phrases on a given knowledge base, comprising the steps of:
step 1, carrying out word segmentation and morphology reduction on noun phrases, enumerating all text segments T on word sequences after word segmentation, namely, composing of all N-gram models in noun phrasesAnd text segments T with reduced word forms corresponding to the text segments lemma Indexing the text segment T to the resources of the knowledge base, and generating a mapping table of the text segment T to the resources in the knowledge base;
step 2, performing part-of-speech tagging according to the participles of the noun phrases to generate a syntactic tree, recursively traversing the whole tree from the top of the tree, and taking leaf nodes, namely the traversal sequence of each word, as an analysis sequence;
step 3, from the concept of EL + + according to the analysis sequence
Figure BDA0003646159500000024
Starting, refining the basic concept generated by indexed resources continuously, analyzing each analyzable word in sequence, and continuing the process until all words are analyzed to obtain a description logic concept mapped by noun phrases:
step 3.1, aiming at the current analyzable word, listing all candidate text segments containing the analyzable word;
step 3.2, according to the mapping table obtained in the step 1, indexing the candidate text segments to corresponding resources, and generating candidate thinning operation according to the corresponding resources;
step 3.3, carrying out consistency screening on the newly generated candidate thinning operation, and screening out the thinning operation inconsistent with the syntax;
3.4, generating a detailed description logic concept for the current analyzable word by using the detailed operation obtained by 3.3, grading the obtained description logic concept, selecting a high k reserve before the score, then checking whether the analysis is finished, namely whether the current analyzable word is the last analyzable word in the analysis sequence, and if not, entering the step 3.1 to analyze the next analyzable word; if yes, entering step 3.5;
the scoring function that describes the logical concept is:
S score (NP,C)=w sup *S sup (NP,C)+w match *S match (NP,C)+w sim *S sim (NP,C)
wherein S sup For the support score, S match Is a piece of paperGrading of the degree of distribution, S sim For simplicity scoring, w sup 、w match 、w sim Is a function of the corresponding weight or weights,
support score S describing logical concepts sup Defining a smooth mean value of the support degree of a support set of a plurality of thinning operations in the process of generating the logic concept, and carrying out the known noun phrase NP and the thinning operation
Figure BDA0003646159500000031
NP I Set of entities described for noun phrases, i.e. extension of phrases, for concepts C, C I For the set of entities described by concept C, for the basic concepts B, B I Refining operation for entity set described by basic concept B
Figure BDA0003646159500000032
For concept C, a part A in C is modified by basic concept B, and Set is supported sup The calculation formula is as follows:
Figure BDA0003646159500000033
wherein,
Figure BDA0003646159500000034
part A, referred to as B modification, is descriptive of epitaxial NP I As such, the first and second electrodes are,
Figure BDA0003646159500000035
part A modified by B is an entity set describing the relationship with epitaxy;
S sup calculated from the following formula, where d refines the concept C,
Figure BDA0003646159500000036
is a support set
Figure BDA0003646159500000037
The support degree of (c):
Figure BDA0003646159500000038
S match defined as the proportion of words in the noun phrase NP that can be matched by the concept C, the calculation formula is as follows:
Figure BDA0003646159500000039
S sim defined as the number of refinement operations in the concept, the calculation formula is as follows:
S sim (C)=-|{d|d∈C}|
step 3.5, the description logic concepts of all the words obtained according to the analysis sequence are kept with the highest score as the output C best I.e., the descriptive logical concepts to which the noun phrases are mapped, are used for the semantic understanding of the noun phrases by the knowledge base.
Compared with the prior art, the invention has the beneficial effects that:
(1) The method for semantic analysis by utilizing a syntax tree in the existing semantic analysis work is less, some methods adopt a method for jointly training the syntax tree and a semantic analysis result or generating a decoding process of a model through syntax information constraint, are mainly supervised learning and semi-supervised learning methods, depend on a training data set, and have unsatisfactory analysis effect; in the invention, under the condition of lacking a training data set aiming at noun phrases, complex noun phrases containing implicit relations are automatically processed by a lightweight method by utilizing the extension of the noun phrases, thereby realizing an unsupervised lightweight algorithm;
(2) The existing bundling method using extension does not consider the relation analysis of complex noun phrases, the core reason of the method is the purpose of the method, and the existing unsupervised method using extension basically aims at extracting a triple supplemental knowledge base, so the existing bundling method using extension does not consider complex phrases. The invention aims to utilize the resources of a given knowledge base to understand noun phrases, in particular to complex noun phrases with nested relation, the invention utilizes the grammatical information of the noun phrases, improves the quality and the efficiency of generated description logic concepts by restricting the grammatical consistency and analyzing sequence, and has the capability of processing the noun phrases with nested relation;
(3) The invention utilizes the epitaxial-based statistical indexes and the index matching indexes to select high-quality description logic concepts in a multidimensional grading mode, thereby improving the accuracy of mapping noun phrases into logic language concepts. The high-quality Wikipedia categories which are randomly extracted and manually labeled and explained are used as a data set, and a verification set and a test set are divided by 5 to obtain index results as follows: the generated EL + + describes the logic concept with a complete matching rate of 0.53 and a partial matching rate of 0.71.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The invention relates to a method for mapping noun phrases to EL + + description logic concepts through extension, wherein the noun phrases are mapped to the logic language concepts expressed by the description logic language EL + + through extension of the noun phrases, so that the understanding of the noun phrases on a given knowledge base DBpedia is generated, and a computer can better understand the noun phrases.
Step 1, using a natural language processing tool to perform word segmentation and morphology reduction on noun phrases, enumerating all text segments T on word sequences after word segmentation, wherein the text segments refer to segments of continuous words in the noun phrases, namely segments formed by all N-gram models, and the text segments T lemma The key used in the establishment of the index dictionary, namely the text alias, may be the original form of the word, for example, the "French" index is less than the entity "dbr: france", but "France" may be the original form fragment, namely the text fragment after the reduction of the word form. Text fragment T and text fragment T with restored corresponding word form lemma Indexing the resources of the given knowledge base, and generating a mapping table of the text segments to the resources in the knowledge base; resources include entities, literal amounts, attributes, types.
And 2, performing part-of-speech tagging according to the participles of the noun phrases to generate a syntactic tree, recursively traversing the whole tree from the top of the tree, and taking leaf nodes, namely the traversal sequence of each word as an analysis sequence, which is specifically as follows.
Step 2.1, generating a syntax tree of noun phrases by using a natural language processing tool;
and 2.2, traversing the whole tree recursively from the top of the tree, taking the traversal sequence of the leaf nodes as an analysis sequence, wherein in the syntax tree, the leaf nodes are all words, the generated analysis word sequence is a certain arrangement sequence of the words of the phrase, the current analyzable word is a word which is currently to be analyzed, and the words are sequentially analyzed according to the analysis sequence.
Furthermore, when the parsing sequence is generated, the head of the noun phrase is defined as the last word of the first noun group, the noun group refers to a long-name phrase formed by modifying nouns of other nouns, and all noun groups corresponding to the noun phrase are obtained through part-of-speech analysis. For noun phrase nodes in a syntax tree, firstly, a sub-node where the head of a current noun phrase is located is used as a new noun phrase node for analysis, then, the sub-node on the left side of the head is analyzed from right to left, and finally, the sub-node on the right side of the head is analyzed from left to right, namely, the analysis sequence is from the head to the near to the far; for the nodes where the verbs or the adverbs start, the verbs or the adverbs are firstly analyzed, and then the rest parts are analyzed according to the original sequence, namely the sequence from left to right or from right to left on the father nodes; for the nodes where adjectives begin, as the adjectives are bound to be used as the modification of the head, the adjectives are firstly analyzed according to the original sequence of phrases except the adjectives, and finally the adjectives are analyzed.
And 3, for one analyzable word, indexing resources according to all text segments of the analyzable word to generate a thinning operation, and obtaining a corresponding description logic concept through the thinning operation. From the concept of EL + + in analytical order
Figure BDA0003646159500000059
Starting, refining the basic concept generated by the indexed resources, and performing refining operation analysis on each analyzable word in sequence, wherein the process is continued until all words are analyzed, and the description logic concept mapped by the noun phrase is obtained, which is specifically as follows.
And 3.1, listing all candidate text segments containing the word aiming at the current resolvable word, namely all the text segments containing the resolvable word.
And 3.2, indexing the text segments to corresponding resources according to the resource mapping table obtained in the step 1, and generating all candidate thinning operations according to the corresponding resources.
The basic concept in the descriptive logic language EL + + includes 5 modalities: EL + + describes basic concept forms corresponding to individual { O }, atomic concept A and role in logic concept
Figure BDA0003646159500000051
And hiding the role name
Figure BDA0003646159500000052
And
Figure BDA0003646159500000053
the resources comprise entities, literal volumes, attributes and types, and for the indexed entities and literal volumes, corresponding forms are generated, including { O } and
Figure BDA0003646159500000054
for the indexed type, corresponding morphology is generated, including A and
Figure BDA0003646159500000055
for the indexed attributes, corresponding forms are generated
Figure BDA0003646159500000056
Defining a refinement operation
Figure BDA0003646159500000057
Comprises the following steps: modifying a part A of C with a basic concept B for concept C, generating all possible refinement operations by enumerating the A part of refined C for known basic concept B and known concept C, wherein for the indexed entity and literal amount o, generating corresponding basic concept { o } and all basic concepts containing hidden roles
Figure BDA0003646159500000058
And generating all thinning operations with the support degree not being 0; for the indexed type A, generating a corresponding basic concept A and all basic concepts containing hidden roles
Figure BDA0003646159500000061
And generating all thinning operations with the support degree not being 0; for indexed attributes p, corresponding to roles r, corresponding base concepts are generated
Figure BDA0003646159500000062
And generating all thinning operations with the support degree not being 0. The support degree here refers to the support degree of the support set of the refinement operation, for the noun phrase NP and the refinement operation d, and the extension NP I
Figure BDA0003646159500000063
Is the support Set sup (NP, d) support degree.
Step 3.3, carrying out consistency screening on the newly generated candidate thinning operation, and screening out the thinning operation inconsistent with the syntax:
if the current word to be analyzed is the head, the preference is as follows
Figure BDA0003646159500000064
In which B is atomic Is an atomic concept, whereas if the current word to be parsed is not a phrase header, then non-phrase is preferably selected
Figure BDA0003646159500000065
And (5) performing formal refinement operation.
3.4, generating a detailed description logic concept for the current analyzable word by using the detailed operation obtained by the 3.3, grading the obtained description logic concept, selecting a reserve with a high k before the score, checking whether the analysis is completed or not, namely the currently analyzed analyzable word is the last word in the analysis sequence, and if not, entering the step 3.1 to analyze the next word; if yes, go to step 3.5.
The scoring function that describes the logical concept is:
S score (NP,C)=w sup *S sup (NP,C)+w match *S match (NP,C)+w sim *S sim (NP,C)
wherein S sup For the support rating, S match Scoring the degree of match, S sim For simplicity scoring, w sup 、w match 、w sim Is the corresponding weight.
Support score S describing logical concepts sup Defining the average value of the smooth values of the support degree of the support set of a plurality of times of thinning operation in the process of generating the logic concept, and carrying out the thinning operation on the known noun phrase NP
Figure BDA0003646159500000066
NP I Set of entities described for noun phrases, i.e. extension of phrases, for concepts C, C I For the entity set described by the concept, the basic concept B, B I Refining operations for entity sets of basic concept descriptions
Figure BDA0003646159500000067
Referring to concept C, a part A in C is modified by basic concept B, and the support set calculation formula is as follows:
Figure BDA0003646159500000068
wherein,
Figure BDA0003646159500000069
part A, referred to as B modification, is descriptive of epitaxial NP I Per se, e.g.
Figure BDA00036461595000000610
In the step (1), the first step,
Figure BDA00036461595000000611
the moiety A, which means modification of B, is a moiety described in relation to epitaxySets of entities, e.g.
Figure BDA00036461595000000612
In (1), the thinned portion of Work is not descriptive of epitaxial NP I But rather a collection of entities with which there is a "basedOn" relationship.
S sup Calculated from the following formula, where d is the refinement operation to generate C,
Figure BDA00036461595000000613
is a support set
Figure BDA0003646159500000071
The support degree of (c), epsilon is an empirical parameter, and is generally set to 1:
Figure BDA0003646159500000072
S match defined as the proportion of words in the phrase that can be conceptually matched to, the calculation formula is as follows:
Figure BDA0003646159500000073
S sim defined as the number of refinement operations in the concept, the calculation formula is as follows:
S sim (C)=-|{d|d∈C}|
step 3.5, the description logic concepts of all the words obtained according to the analysis sequence are retained with the score S score Highest as output C best I.e., the descriptive logical concepts to which the noun phrases are mapped, are used for the semantic understanding of the noun phrases by the knowledge base.
The invention is described in further detail below with reference to the figures and the specific embodiments. In particular, the weighting parameter is set to w sup =0.3,w match =0.5,w sim K is 5, experiment select 2016-10 version of DBpedia as the knowledge base.
Examples
The input noun phrase is 'Films based on works by Alice Munro', the entity set described by the phrase is dbr: away _ from _ Her, dbr: edge _ of _ Madness
Figure BDA0003646159500000074
Figure BDA0003646159500000075
The present invention is further described in detail with reference to examples, so that those skilled in the art can implement the present invention with reference to the description.
With reference to fig. 1, the present invention specifically comprises the following steps:
step 1, exhausting all text segments of noun phrases, and generating a mapping table from the text segments to resources in a knowledge base, wherein the mapping table specifically comprises the following steps:
using a natural language processing tool to carry out word segmentation and word shape reduction on noun phrases to obtain a word sequence of [ files, base, on, works, by, alice, munro ]]", prototype sequence" [ film, base, on, work, by, alice, munro]". And enumerating all text segments and the text segments after the corresponding word forms are restored for the word sequences after word segmentation. Text fragment T and text fragment T with corresponding morphology reduced lemma And indexing to corresponding knowledge base resources, wherein the resources comprise entities, literal amounts, attributes and types. An index dictionary used for indexing is constructed offline through anchor text, tag attribute values and redirection in DBpedia and stored as<Natural language text, resources>For fast lookup in the indexing process.
The indexing results in partial text segments and corresponding attributes as shown in table 1.
TABLE 1
Figure BDA0003646159500000076
Figure BDA0003646159500000081
Step 2, generating an analysis sequence according to the word segmentation, part-of-speech tagging and the syntax tree of the noun phrase, which comprises the following specific steps:
and 2.1, generating a syntax tree of the noun phrase by using a natural language processing tool. The syntax tree of "Films based on work by Alice Munro" is "(TOP (NP (_ Films)) (VP (_ based) (PP (_ on) (NP (_ works)) (PP (_ by) (NP (_ Alice) (_ Munro))))))))) ("
And 2.2, recursively traversing the whole tree from the top of the tree, and taking the traversal sequence of the leaf nodes (namely each word) as a resolution sequence.
The head of a noun phrase is defined as the last word of the first noun group. For noun phrase nodes, the sub-node where the head of the current noun phrase is located is firstly used as a new noun phrase node for analysis, then the sub-node on the left side of the head is analyzed from right to left, and finally the sub-node on the right side of the head is analyzed from left to right. That is, the order of analysis is from the head to the back. For the nodes where the verb or adverb begins, the verb or adverb is parsed first, and then the remaining parts are parsed in the original order. For the starting node of the adjective, as the adjective is inevitably used as the modification of the head, the other parts are firstly analyzed according to the original sequence phrase, and finally the adjective is analyzed.
For 'Films based on works by Alice Munro', the head 'files' is processed first, then the part on the right of the head is processed from left to right, and since the first word 'based' of the node is a verb, the verb 'based' is processed first, and then 'on' is processed in the original order. The new noun phrase "works by Alice Munro" now appears. For this part, the part's header "works" is processed first, then "by" is processed in order, and finally the new noun phrase "Alice Munro" is processed. Therefore, the analysis order is "files", "base", "on", "works", "by", "Munro" or "Alice".
Step 3, from the concept of EL + + according to the analysis sequence
Figure BDA0003646159500000086
Initially, the basic concept of resource generation with indexing continuesRefining until all words are analyzed, and obtaining a description logic concept mapped by the noun phrases, wherein the description logic concept is as follows:
and 3.1, aiming at the current analyzable word, generating all candidate text segments containing the analyzable word. For example, for the parsable word file, all candidate text segments are generated, including "files", "file based on", and so on.
And 3.2, indexing the corresponding resources from the text segments according to the resource indexes obtained in the step 1. And generating all candidate thinning operations according to the corresponding resources. For the text segment "files", the indexed resources include the type "dbo: film", the entity "dbr: film", the attribute "dbo: openingFilm", and the like, and generate the basic concept
Figure BDA0003646159500000082
Film, { Film }, etc., for refining known concepts, resulting in refining operations such as
Figure BDA0003646159500000083
And so on.
And 3.3, carrying out consistency screening on the newly generated candidate thinning operation, and screening out the thinning operation inconsistent with the syntax. For the current header "files", screen out
Figure BDA0003646159500000084
Etc. reserve
Figure BDA0003646159500000085
And 3.4, sequencing the refined concepts, and selecting the concept with the highest k before the score is high. For each concept, checking whether the analysis is completed or not, and if not, entering a step 3.1; if yes, go to step 3.5.
When only one concept Film exists, because the concept Film is not resolved, 3.1 is directly entered to search possible refinement operation again. After multiple cycles, the concept of candidate after refinement is
Figure BDA0003646159500000091
Figure BDA0003646159500000092
Figure BDA0003646159500000093
And the like. Score it, e.g. by
Figure BDA0003646159500000094
The matching degree score of (2) is 6/7=0.857, the support degree score is 0.83, the cleanliness is-4, and the calculated score is-0.1225. The concept of the top 2 scores is retained as
Figure BDA0003646159500000095
And
Figure BDA0003646159500000096
Figure BDA0003646159500000097
when the resolution is found to be completed, the process proceeds to step 3.5.
Step 3.5, for all the concepts so far, the highest score is retained and is C best As an output. For this embodiment, the output score is highest
Figure BDA0003646159500000098
Compared with the work of Cat2Ax and Pasca et al (i.e., H-M decomplex), the invention has more excellent accuracy and can give more complete and better mapping results of descriptive logic concepts, as shown in Table 2.
TABLE 2
Partial match rate Complete matching rate
The invention 0.71 0.53
Cat2Ax 0.42 0.21
H-M decompose* 0.36 0.29

Claims (5)

1. A method for mapping noun phrases to descriptive logical concepts based on extensions, wherein mapping noun phrases to logical language concepts expressed in a descriptive logical language EL + + through the extensions of noun phrases generates an understanding of noun phrases on a given knowledge base, comprising the steps of:
step 1, carrying out word segmentation and morphology reduction on noun phrases, enumerating all text segments T on word sequences after word segmentation, namely segments formed by all N-gram models in noun phrases, and text segments T after morphology reduction corresponding to the text segments lemma Indexing the text segments to resources of a knowledge base, and generating a mapping table of the text segments to the resources in the knowledge base;
step 2, performing part-of-speech tagging according to the participles of the noun phrases to generate a syntactic tree, recursively traversing the whole tree from the top of the tree, and taking leaf nodes, namely the traversal sequence of each word, as an analysis sequence;
step 3, continuously refining the basic concept generated by the indexed resources from the concept T of EL + + according to the analysis sequence, and analyzing each analyzable word in sequence, wherein the process is continued until all words are analyzed to obtain the description logic concept mapped by the noun phrases:
step 3.1, aiming at the current analyzable word, listing all candidate text segments containing the analyzable word;
step 3.2, according to the mapping table obtained in the step 1, indexing the candidate text segments to corresponding resources, and generating candidate thinning operation according to the corresponding resources;
step 3.3, carrying out consistency screening on the newly generated candidate thinning operation, and screening out the thinning operation inconsistent with the syntax;
3.4, generating a detailed description logic concept for the current analyzable word by using the detailed operation obtained by the 3.3, grading the obtained description logic concept, selecting a high k before the score to be reserved, then checking whether the analysis is finished, namely whether the currently analyzed analyzable word is the last analyzable word in the analysis sequence, and if not, entering the step 3.1 to analyze the next analyzable word; if yes, entering step 3.5;
the scoring function that describes the logical concept is:
S score (NP,C)=w sup *S sup (NP,C)+w match *S match (NP,C)+w sim *S sim (NP,C)
wherein S sup For the support score, S match Scoring the degree of match, S sim For clarity scoring, w sup 、w match 、w sim Is a function of the corresponding weight or weights,
support score S describing logical concepts sup Defining a smooth mean value of the support degree of a support set of a plurality of thinning operations in the process of generating the logic concept, and carrying out the known noun phrase NP and the thinning operation
Figure FDA0003646159490000012
NP I Set of entities described for noun phrases, i.e. extension of phrases, to concepts C, C I For the set of entities described by concept C, for the basic concepts B, B I Refining operation for entity set described by basic concept B
Figure FDA0003646159490000013
The method is characterized in that for a concept C, a part A in the concept C is modified by a basic concept B, and a support Set is Set sup The calculation formula is as follows:
Figure FDA0003646159490000011
wherein,
Figure FDA0003646159490000027
part A, referred to as B modification, is descriptive of epitaxial NP I In and of itself, the first and second,
Figure FDA0003646159490000028
part A modified by B is an entity set describing the relationship with epitaxy;
S sup is calculated by the following formula, wherein d represents the refinement operation on the concept C,
Figure FDA0003646159490000021
is a support set
Figure FDA0003646159490000022
The support degree of (c):
Figure FDA0003646159490000023
S match defined as the proportion of words in the noun phrase NP that can be matched by the concept C, the calculation formula is as follows:
Figure FDA0003646159490000024
S sim defined as the number of refinement operations in the concept, the calculation formula is as follows:
S sim (C)=-|{d|d∈C}|
step 3.5, the description logic concepts of all the words obtained according to the analysis sequence are kept with the highest score as the output C best I.e., the descriptive logical concepts to which the noun phrases are mapped, are used for the semantic understanding of the noun phrases by the knowledge base.
2. The method for mapping noun phrases to descriptive logical concepts based on extensions according to claim 1, wherein the resources include entities, literal amounts, attributes, and types.
3. The method for mapping noun phrases to descriptive logic concepts based on epitaxy as claimed in claim 1, wherein when the parsing order is generated, all noun groups corresponding to noun phrases are obtained through part-of-speech analysis, the head of a noun phrase is defined as the last word of the first noun group, for noun phrase nodes in a syntax tree, a sub-node where the head of the current noun phrase is located is first parsed as a new noun phrase node, then a sub-node on the left side of the head is parsed from right to left, and finally a sub-node on the right side of the head is parsed from left to right, that is, the parsing order is from near to far from the head; for the nodes where the verbs or the adverbs start, the verbs or the adverbs are firstly analyzed, and then the rest parts are analyzed according to the sequence from left to right or from right to left on the father nodes; for the nodes where the adjectives begin, the adjectives are analyzed according to the phrases of the father nodes except the adjectives, and finally the adjectives are analyzed.
4. The method for mapping noun phrases to descriptive logical concepts based on extensions according to claim 1, characterized in that step 3.2 generates all candidate refinement operations based on the corresponding resources, as follows:
the basic concept in definition description logic includes 5 modalities: EL + + describes the basic concept forms corresponding to the individual { O }, the atom concept A and the role in the logic concept
Figure FDA0003646159490000029
And is hiddenHiding the name of the character
Figure FDA00036461594900000210
And
Figure FDA00036461594900000211
for indexed entities and literal volumes, corresponding morphologies are generated, including { O } and
Figure FDA0003646159490000025
for the indexed type, corresponding forms are generated, including A and
Figure FDA00036461594900000212
for the indexed attributes, corresponding forms are generated
Figure FDA00036461594900000213
Defining refinement operations
Figure FDA0003646159490000026
Comprises the following steps: modifying a part A of C with a basic concept B for concept C, generating all possible refinement operations by enumerating the A part of refined C for known basic concept B and known concept C, wherein for the indexed entity and literal amount o, generating corresponding basic concept { o } and all basic concepts containing hidden roles
Figure FDA0003646159490000033
And generating all thinning operations with the support degree not being 0; for the indexed type A, generating a corresponding basic concept A and all basic concepts containing hidden roles
Figure FDA0003646159490000034
And generating all thinning operations with the support degree not being 0; for indexed attributes p, corresponding roles r, corresponding base concepts are generated
Figure FDA0003646159490000035
And generating all thinning operations with the support degree not being 0.
5. The method for mapping noun phrases to descriptive logical concepts based on extensions according to claim 1, characterized in that in step 3.3, if the current word to be resolved is the head, preference is given to the word as
Figure FDA0003646159490000031
In which B is atomic Is an atomic concept, whereas if the current word to be parsed is not a phrase header, then non-phrase is preferably selected
Figure FDA0003646159490000032
A formal refinement operation.
CN202210530158.4A 2022-05-16 2022-05-16 Method for mapping noun phrases to descriptive logic concepts based on extension Pending CN115186671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210530158.4A CN115186671A (en) 2022-05-16 2022-05-16 Method for mapping noun phrases to descriptive logic concepts based on extension

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210530158.4A CN115186671A (en) 2022-05-16 2022-05-16 Method for mapping noun phrases to descriptive logic concepts based on extension

Publications (1)

Publication Number Publication Date
CN115186671A true CN115186671A (en) 2022-10-14

Family

ID=83513914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210530158.4A Pending CN115186671A (en) 2022-05-16 2022-05-16 Method for mapping noun phrases to descriptive logic concepts based on extension

Country Status (1)

Country Link
CN (1) CN115186671A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556024A (en) * 2024-01-10 2024-02-13 腾讯科技(深圳)有限公司 Knowledge question-answering method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556024A (en) * 2024-01-10 2024-02-13 腾讯科技(深圳)有限公司 Knowledge question-answering method and related equipment
CN117556024B (en) * 2024-01-10 2024-04-30 腾讯科技(深圳)有限公司 Knowledge question-answering method and related equipment

Similar Documents

Publication Publication Date Title
Sun et al. SPARQA: skeleton-based semantic parsing for complex questions over knowledge bases
US8156053B2 (en) Automated tagging of documents
JP6618735B2 (en) Question answering system training apparatus and computer program therefor
Grishman Information extraction
CN101079026B (en) Text similarity, acceptation similarity calculating method and system and application system
US9098489B2 (en) Method and system for semantic searching
US9069750B2 (en) Method and system for semantic searching of natural language texts
McDonald et al. Discriminative learning and spanning tree algorithms for dependency parsing
WO2018000272A1 (en) Corpus generation device and method
CN110619043A (en) Automatic text abstract generation method based on dynamic word vector
CN112307171B (en) Institutional standard retrieval method and system based on power knowledge base and readable storage medium
US20220245353A1 (en) System and method for entity labeling in a natural language understanding (nlu) framework
CN111291573A (en) Phrase semantic mining method driven by directed graph meaning guide model
CN112926337A (en) End-to-end aspect level emotion analysis method combined with reconstructed syntax information
Cui et al. Simple question answering over knowledge graph enhanced by question pattern classification
Zhang et al. Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems
CN115186671A (en) Method for mapping noun phrases to descriptive logic concepts based on extension
Song et al. Multiple order semantic relation extraction
Song et al. Semantic query graph based SPARQL generation from natural language questions
Passarotti et al. Improvements in parsing the Index Thomisticus treebank. revision, combination and a feature model for medieval Latin
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
US20220245352A1 (en) Ensemble scoring system for a natural language understanding (nlu) framework
US20220229987A1 (en) System and method for repository-aware natural language understanding (nlu) using a lookup source framework
Zhang et al. From coarse to fine: Enhancing multi-document summarization with multi-granularity relationship-based extractor
Ding et al. Hierarchical clustering for micro-learning units based on discovering cluster center by LDA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination