CN113312922A

CN113312922A - Improved chapter-level triple information extraction method

Info

Publication number: CN113312922A
Application number: CN202110399643.8A
Authority: CN
Inventors: 李少锋; 王妍妍; 王玉坤; 高菁; 陈文颖; 张春晖
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2021-08-27
Anticipated expiration: 2041-04-14
Also published as: CN113312922B

Abstract

The invention provides an improved chapter-level triple information extraction method, which comprises the following steps: firstly, preprocessing text data; secondly, performing chapter-level semantic analysis on the text data, wherein the chapter-level semantic analysis comprises hierarchical semantic analysis, entity alignment and dependency verb extraction; thirdly, carrying out heuristic learning by adopting a multi-round iteration mode to construct an event semantic model; fourthly, extracting the triples based on the end-to-end samples and extracting the triples based on chapter understanding; and fifthly, applying the triple knowledge extracted in the third step and the fourth step to intelligent retrieval, intelligent question answering, knowledge mining, decision support and the like. The method realizes the establishment of the triple information extraction model based on the small sample, has chapter-level triple extraction capability, is easy to popularize, has expansibility, and is an important basic link for large-scale text information data extraction, knowledge system establishment and vertical field knowledge map construction.

Description

Improved chapter-level triple information extraction method

Technical Field

The invention relates to an improved chapter-level triple information extraction method.

Background

The research of natural language processing starts from the skill of vocabularies and dictionaries, and sentences are always taken as the most core research objects in recent years to theoretically explore semantic analysis multi-affair linguists of chapters; while discourse levels lack formal markup such that there has been no particularly significant progress in linguistic computation at the discourse level. However, many semantic problems must be solved fundamentally at the chapter level, such as coreference resolution, chapter structure and semantic relationship recognition, event fusion and relationship recognition, and the like; meanwhile, the solution of these discourse-level semantic problems has a contradictory guiding meaning for the analysis of the vocabulary level and the sentence level. In another aspect. In recent years, the development of Chinese vocabulary and sentence level natural language processing technology, especially the stage results of word sense disambiguation, syntactic analysis, semantic role labeling and other research works, also creates technical conditions for the research of chapter semantic analysis.

Generally, a chinese sentence pattern is usually long, and one sentence often includes a plurality of entity information, so that the number of entity pairs formed therefrom is also large, and the number distribution of entity types is not uniform. Compared with the relation exploration and relation extraction of simple sentences, the sentence pattern of a long sentence is complex, so that the tasks of relation detection and relation extraction are more difficult; long sentences often include multiple entity information, and multiple verbs often exist in sentences in which pairs of entities across long distances. Therefore, how to select verbs which can effectively represent whether semantic relations exist between entity pairs and specific relation types becomes a key for relation exploration and relation extraction; the biggest challenge in current extraction is that training data is insufficient, and the distribution of relationship examples on each category is extremely unbalanced. At present, the means for extracting the entity relationship mainly comprises means based on a template, dependency syntax analysis, deep learning and the like. However, the main problem of the entity relationship extraction based on the template is that the accuracy and recall rate are low. Entity relationship extraction based on dependency syntax faces the problem of semantic loss. Entity relationship extraction based on deep learning obtains better experimental results in some fields, and has no obvious performance difference, but the cost is that a large number of training and testing samples need to be marked on predefined relationship types, the samples are relatively simple short sentences, and the sample distribution of each relationship is relatively uniform. However, manually labeling the sentence-level data accurately is very expensive, and requires a lot of time and labor. In a practical scenario, relying on manual labeling to train data is an almost impossible task to accomplish for thousands of relationships, tens of millions of entity pairs, and hundreds of millions of sentences. Meanwhile, in practical situations, the occurrence frequency of the relationships among the entities and the entity pairs is often subjected to long tail distribution, and a large number of relationships or entity pairs with few samples exist. The effect of the neural network model needs to be guaranteed by relying on large-scale annotation data, and the problem of 'lifting ten against one' exists. How to improve the learning ability of the depth model and realize 'one-to-one-against-three' is a problem to be solved by relation extraction. Furthermore, existing models extract relationships between entities primarily from a single sentence, requiring that the sentence must contain both entities. In fact, a large number of inter-entity relationships are often represented in multiple sentences of a document, and even in multiple documents. How to perform relationship extraction in a more complex context is also a problem faced by relationship extraction. In the existing task setting, a predefined closed relation set is generally assumed, and tasks are converted into a relation classification problem. In this way, the new relationships between entities contained in the text cannot be effectively obtained. The above approach achieves a certain effect on a test set which is relatively simple short sentences and has relatively uniform sample distribution of each relationship, but in practical application, especially in extraction of triple information of text at chapter level, there are many problems, such as data scale, learning ability, complex context, open relationship, and the like. If a theory and a method system of chapter semantic analysis with theoretical depth and practical feasibility can be established, the method has important significance for the development of natural language processing academy and application.

In the information era, how to mine and establish a comprehensive and accurate knowledge system from massive text data and related reports, construct a vertical domain knowledge graph, and perform subsequent intelligent search, intelligent question and answer knowledge mining, decision support and other subsequent applications become technical problems, the chapter-level triple information extraction method is an effective means, and a set of method capable of accurately extracting high-quality entity association relation based on a small amount of labeled samples is needed in order to enable knowledge information extracted from chapters to be applied in the industry on a large scale.

Disclosure of Invention

The purpose of the invention is as follows: the method for extracting the chapter-level triple information is provided for mining and establishing a comprehensive and accurate knowledge system and knowledge map from mass text data and related report annual reports, the natural language processing technology and the machine learning algorithm are utilized, the high-quality entity association relation extraction based on the limited sample condition is realized, the vertical-field knowledge map is established, the establishment of the field knowledge system is powerfully supported, and the information relation mining and studying and judging are assisted.

In order to solve the above technical problem, the present invention provides an improved chapter-level triple information extraction method, including the following steps:

step 1, preprocessing text data;

step 2, performing chapter-level semantic analysis on the text data;

step 3, carrying out heuristic learning by adopting a multi-round iteration mode, and constructing an event semantic model;

and 4, extracting the triples based on the end-to-end samples.

The step 1 comprises the following steps:

step 1-1, converting the format of the text data, namely converting the format of the acquired text data into a format which can be directly subjected to natural language processing by adopting the existing natural language processing technology, such as extracting texts from pdf and doc;

step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology;

step 1-3, text data chapter structure processing: splitting a long document into text blocks by paragraphs and periods;

and 1-4, splitting text data sentence blocks, and further splitting the text blocks into physical sentence blocks with punctuation intervals.

The step 1-2 comprises the following steps: sequentially executing the following processing on the text data after format conversion: conversion of full corners and half corners, conversion of capital letters into lowercase numbers, conversion of capital letters into lowercase letters, removal of emoticons, removal of all characters in the text, and retention of only Chinese, Chinese text segmentation, conversion of traditional simplified Chinese, and filtering of stop words of Chinese text.

The steps 1 to 4 comprise:

step 1-4-1, for the parentheses in the text block, if the content in the parentheses and the adjacent component on the left side thereof are in a semantic relationship close (the semantic component relationship in the same semantic fragment is close, and the semantic component relationship of different semantic fragments is not close, for example, the relationship between the subject and the object in the fragment 1 is close, and the relationship between the subject in the fragment 1 and the object in the fragment 2 is not close), merging the content in the parentheses and the adjacent text component on the left side of the parentheses into a semantic component, otherwise, not processing the parentheses;

step 1-4-2, for quotation marks in sentence blocks, if quotation mark bodies belong to one part of a named entity (the named entity refers to an entity with specific significance in a text and mainly comprises a name of a person, a place name, a mechanism name, a proper noun and the like; a named entity library can be established), merging the quotation mark bodies and the named entities, otherwise, not processing;

and 1-4-3, for other symbols in the sentence block, if the symbols are part of the named entity, combining the other symbols in the sentence block (such as interval numbers in foreign names, book title numbers added to some books and the like) and related contexts into a semantic entity, and otherwise, taking the other symbols in the sentence block as marks for dividing the physical sentence block.

The step 2 comprises the following steps:

step 2-1, performing semantic analysis on continuous texts in chapters by using known syntactic and syntactic knowledge of linguistics, and respectively generating a list consisting of parse trees for each continuous text block;

step 2-2, decomposing complex semantics into a hierarchical semantic structure by combining the information structure of the text data, the category of the terms playing a specific role and the category of the text data;

step 2-3, entity alignment is carried out;

and 2-4, extracting the latest syntactic dependency verb by the entity.

In step 2-2, each level in the hierarchical semantic structure comprises N semantic blocks related to facts or concepts, and N is a natural number; according to the sequence of subsequent traversal, preferentially executing query operation on semantic blocks of a nested layer (the nested layer is a semantic block with nested semantics in the semantic blocks, the complex semantics are decomposed into a hierarchical semantic structure through the step 2-2, and a plurality of semantics can be nested), determining the extension of the nested layer, and after the processing of the nested layer is finished, executing query operation on the semantic blocks of other facts or concepts, and determining the extension of each semantic block.

The step 2-3 comprises the following steps:

judging whether a same-name entity exists in a pre-established entity library according to the entity name, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the same-name entity pairs, calculating the similarity between the target entity pair and each entity pair, comprehensively scoring candidate ordering according to the respective similarities of the category label, the attribute label and the unstructured text keywords, if the score is smaller than a threshold (the size of the threshold cannot be quantized, and needs to be adjusted timely according to specific conditions), adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity. Entity alignment is the determination of whether two or more entities from different sources point to the same object in the real world. If a plurality of entities represent the same object, an alignment relation is constructed among the entities, and meanwhile information contained in the entities is fused and aggregated. The target entity is an entity extracted from the text, and the purpose here is to determine whether the entity in the text and the entity in the entity library have a co-reference relationship.

The steps 2-4 comprise:

step 2-4-1, setting two different entities as e_iAnd e_jRespectively extracting the compounds with the following method_iAnd e_jDependency associated node e with parallel structure or middle structure_i' and e_j': setting the current node as a father node of e, continuously traversing all nodes if the dependency relationship of the father node is a parallel structure or a fixed structure relationship, continuously traversing if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship, and returning to the father node if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship;

step 2-4-2, the 2 nd entity e is extracted by the following method_jDependent association node e of_j' verb V whose dependency relationship is closest to occurrence_j: initializing a return value to null, and setting a current node as a father node of e; and when the current node is not the root node, executing judgment: if the verb node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, if not, the father node of the current node is set as the current node, and the judgment is continued;

step 2-4-3, obtaining the 1 st entity e by the following method_iDependent association node e of_i' verb V closest to subject relation or preposition object relation_i: initializing the return value to null value null, and setting the current nodeA parent node of e; and when the current node is not the root node, executing judgment: if the verb node is a verb node and the entity have a dominating-predicate relationship or a preposing object relationship, the verb node is a verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;

step 2-4-4, through judging verb V_iAnd V_jWhether the entity pair is the same verb or the parallel structure relationship is determined<e_ie_j>Depends on the verb DV, thereby determining a triplet.

The step 3 comprises the following steps:

step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and association knowledge according to a hierarchical semantic structure;

step 3-2, generating extraction knowledge according to the parsing tree generated by the training corpus and the parameter mapping, specifically comprising: step 3-2-1, independently constructing a mapping rule for each semantic level with parameter mapping; the mapping rule refers to a rule from a specific semantic level to a target structure segment;

step 3-2-2, if parameter mappings at different levels exist in the same parse tree, constructing an identification rule containing the levels according to the nested points (preferentially utilizing a target structure to construct the identification rule, and utilizing a semantic structure instead when the target structure cannot be utilized); a nesting point refers to a sentence of text containing a plurality of semantic phrases; the identification rule refers to that for the same target structure, the parse trees with the default components and the reference components can complete the components by contrasting with the complete parse tree;

3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a cross sentence block identification rule according to the association points; the association point is a connection point formed by default and reference relations among different sentence blocks, namely a precedent and a reference in the reference, and a precedent and a default in the default;

3-2-4, if more than two sentence blocks appear in the end sample, the sentence blocks contain parameter mapping, and the end sample does not provide associated marking information about the sentence blocks, actively prompting a user to supplement corresponding associated marking;

and 3-2-5, if the center component modified and limited by the modifier in one layer is extracted, and other components in the layer are not extracted, the layer is not processed.

The step 4 comprises the following steps:

step 4-1, obtaining a primary first-order logic formula according to the hierarchical semantic structure of the input text;

step 4-2, performing association reasoning by using a first-order logic formula (the first-order logic formula is obtained by text semantic analysis, can be a rule or a fact), and realizing the variable unification of the first-order logic formula by using the default, reference and unification relations among contexts to obtain a unified first-order logic formula after default recovery, reference resolution and entity unification;

4-3, mapping reasoning is carried out by utilizing a unified first-order logic formula, and each independent first-order logic formula can possibly generate a primary target structure segment;

4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or a primary target structure fragment to obtain a coupling target structure fragment;

step 4-5, if the predicates of the two adjacent or overlapped coupling target structure segments at the positions are the same, but the subjects and objects corresponding to the predicates in the text phrases are completely different, or the values of the same parameters are also the same, directly combining the two adjacent or overlapped coupling target structure segments at the positions into a larger target structure as final output; otherwise, executing step 4-6;

step 4-6, regarding two coupled target structure segments with adjacent or overlapped positions as different target structure examples of the same predicate, and taking the different target structure examples as final output;

and 4-7, repeating the steps 4-5 and 4-6 until no new and larger coupling target segments are generated, and obtaining all target structure examples which are final outputs.

The invention also comprises a step 5, and some applications of the triple knowledge extracted in the steps 3 and 4, such as intelligent retrieval, intelligent question answering, knowledge mining, decision support and the like.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the method adopts a hierarchical semantic analysis technology based on a semantic mode, realizes heuristic learning aiming at end-to-end samples by utilizing the hierarchical semantic analysis technology, can achieve the effect of learning against one another, realizes extraction of triple information on the basis of discourse-level understanding, and ensures that the extraction result of the triple information is complete and available;

(2) and realizing small sample training through heuristic learning. Because knowledge used in the event semantic model is based on the semantic mode, and the semantic mode is highly multiplexed in natural language expression, one end sample can contribute to highly multiplexed extraction knowledge, and therefore training can be completed without a huge amount of samples, and the problem of lack of effective samples is effectively solved.

(3) The method is based on chapter-level semantic analysis, has expansibility, and can be used for extracting not only binary relations (three-tuple) but also multivariate relations;

(4) the method has high accuracy and recall rate, and is an effective means for forming a high-quality knowledge map in the vertical field and realizing intelligent analysis of the field knowledge.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a block flow diagram of the present invention.

FIG. 2 is a flow chart of the text data preprocessing of the present invention.

Fig. 3 is a flow chart of entity alignment of the present invention.

FIG. 4 is an exemplary diagram of a hierarchical semantic structure of the present invention.

Detailed Description

Aiming at the common problems of the triple information extraction fields such as inaccurate extraction information, large training sample size, high cost and the like in the conventional triple extraction, the method adopts a hierarchical semantic analysis technology based on a semantic mode to establish an event semantic model, forcefully captures entity relationships and information structures contained in texts, reduces the number of required samples by adopting heuristic learning, realizes chapter-level triple information extraction, can effectively solve or improve the problems of data size, learning capacity, complex context, open relationship and the like, and can form a high-quality vertical field knowledge map. The invention provides an improved chapter-level triple information extraction method, as shown in fig. 1, including:

step 1, preprocessing text data;

step 1-1, converting the format of text data, and extracting effective text content from documents in pdf, docx and other formats;

and 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology. The converted text data may contain useless information such as advertisements, special characters without practical significance and the like, and the text data is preprocessed by adopting a natural language processing technology, wherein the preprocessing comprises the following steps: the conversion of full corners and half corners, the conversion of capital letters into lowercase numbers, the conversion of capital letters into lowercase letters, the removal of emoticons, the removal of all characters in the text (only Chinese is reserved), the division of Chinese text, the conversion of traditional Chinese text into simplified Chinese, the filtration of stop words of Chinese text and the like, wherein the preprocessing flow chart is shown in figure 2;

step 1-3, text data chapter structure processing, splitting a longer document into a plurality of text blocks (knowledge points);

step 1-4, text data sentence block splitting, text block further splitting into punctuation mark interval physical sentence blocks, specifically comprising:

step 1-4-1, for the parentheses in the sentence block, if the content in the parentheses is in close coupling relation with the left adjacent component, combining the two into a semantic component, otherwise, processing the parentheses in addition;

step 1-4-2, for quotation marks in the sentence block, if the quotation mark body belongs to a part of a certain named entity, merging the quotation mark body with the named entity, otherwise, not processing;

step 1-4-3, for other symbols in the sentence block, if the symbol is a part of the named entity, combining the punctuation symbol and the related context into a semantic entity, otherwise, taking the punctuation symbol as a mark for dividing the physical sentence block;

step 2, performing chapter-level semantic analysis on the text data;

step 2-1, carrying out semantic analysis on continuous texts in the chapters by using known linguistic knowledge, and respectively generating a list consisting of an analytic tree for each continuous text block;

step 2-2, combining the information structure of the text data, the category of the terms that play a specific role, the category of the text data, decomposing the complex semantics into a hierarchical semantic structure, as follows step 2-3, step 2-4, the hierarchical semantic structure is exemplified as shown in fig. 4 (adding the following contents: the text "edison in the figure invented the incandescent lamp that illuminates the night," actually, the text "edison invented the incandescent lamp" and "the incandescent lamp lighted the night" are nested, specifically, the text "fact | edison basically expresses 1," fact | edison, invented, the incandescent lamp that illuminates the night "constitutes the first layer of meaning, wherein" edison "is the fact," the incandescent lamp "is the fact," and "the incandescent lamp that illuminates the night" constitutes the nested sublayer about "the incandescent lamp," and may also say, "the incandescent lamp that" illuminates the night "is a phrase that takes" the incandescent lamp "as the central word. The two layers of semantics are coupled together as a nested point. ) (ii) a

Step 2-3, obtaining the hierarchical semantic structure, wherein each hierarchy comprises a plurality of semantic blocks related to facts or concepts;

2-4, according to the sequence of subsequent traversal, preferentially executing operations such as query on the semantic blocks of the nested layer, determining the extension of the semantic blocks, and so on;

2-5, as shown in fig. 3, performing entity alignment, firstly judging whether an entity with the same name exists in an entity library according to the name of the entity, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the entity pairs with the same name, calculating the similarity between the target entity pair and each acquired entity pair, comprehensively scoring candidate ordering of the calculated result according to the respective similarities of the category label, the attribute label and the unstructured text keywords, if the score is smaller than a threshold value, adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity;

step 2-6, extracting the latest syntactic dependency verb by the entity, wherein the specific steps are as follows, step 2-7, step 2-8, step 2-9 and step 2-10;

step 2-7, respectively extracting and entity e_iAnd e_jDependency associated node e with parallel structure or middle structure_i' and e_j', as algorithm 2-1;

step 2-8, extracting and 2 nd entity e_jDependent association node e of_j' verb V whose occurrence dependency relationship is closest to_jSuch as algorithm 2-2;

′

step 2-9, obtaining the 1 st entity e_iDependent association node e of_iVerb V with nearest distance for occurrence of cardinal-to-predicate relation or prepositive object relation_iSuch as algorithm 2-3;

step 2-10, judging verb V_iAnd V_jWhether the entity pair is the same verb or the parallel structure relationship is determined<e_ie_j>Depends on the verb DV, from which a triple can be determined;

algorithm 2-1, extracting entity dependent association node

Algorithm 2-2, extracting verb closest to the dependency relationship of the 2 nd entity

Algorithm 2-3, extracting verb with nearest main-predicate relation or preposition object relation with the 1 st entity

step 3-2, generating extraction knowledge according to the analysis tree generated by the end sample and the parameter mapping, and specifically comprising the following steps:

step 3-2-1, independently constructing a mapping rule for each semantic level with parameter mapping;

3-2-2, if parameter mappings at different levels exist in the same parse tree, constructing an identification rule containing the levels according to the nested points;

3-2-3, if parameter mapping related to the same target structure exists in different parse trees, trying to construct a cross sentence block recognition rule according to the association points;

step 3-2-4, if a plurality of sentence blocks (namely a plurality of parse trees correspondingly exist) appear in the end sample, and the sentence blocks contain parameter mapping, and the end sample does not provide associated marking information about the sentence blocks, actively prompting a user to supplement corresponding associated marking;

3-2-5, if the central word of a certain level is extracted and other components in the level are not extracted, the level can be ignored;

step 4, extracting the triples based on the end samples;

and 4-2, performing correlation reasoning by using a first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts. Obtaining an unification first-order logic formula after default recovery, reference resolution and entity unification;

4-3, mapping reasoning is carried out by utilizing a unified first-order logic formula, and each independent first-order logic formula can possibly generate a plurality of native target structure fragments;

4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or target structure fragments to obtain a plurality of coupled target structure fragments;

and 4-5, if the predicates of the two coupled target structure segments which are adjacent or overlapped in position are the same, but the parameters are completely different, or the values of the same parameters are also the same, directly combining the two coupled target structure segments into a larger target structure as final output. Otherwise, executing step 4-6;

step 4-6, regarding the two as different target structure examples of the same predicate, and taking the different target structure examples as final output;

Step 5, some applications of the triple knowledge extracted by the step three and the step four are: for example, the intelligent search and the hundred-degree search of the existing American president mainly display a certain president A and a certain president B, which indicates that the retrieval technology needs to be further improved; the intelligent question answering can be regarded as an extension of semantic search, and can provide not only scene conversation but also knowledge of various industries by applying a chat robot, a knowledge graph which depends on the intelligent question answering is a knowledge graph in the open field, the provided knowledge is very wide, daily knowledge can be provided for a user, and chat type conversation can also be performed; the personalized recommendation system analyzes social relations among users and association relations between the users and products by collecting interest preferences and attributes of the users and product classification, attributes, contents and the like, and deduces the preference and demand of the users by using a personalized algorithm, thereby recommending interesting products or contents for the users; the assistant decision-making is to analyze and process knowledge by using the knowledge of the knowledge graph, and to obtain a certain conclusion through logical reasoning of a certain rule, so as to provide support for the user to decide.

The present invention provides an improved chapter-level triplet information extraction method, and a plurality of methods and approaches for implementing the technical solution, and the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the embodiment can be realized by the prior art.

Claims

1. An improved chapter-level triple information extraction method is characterized by comprising the following steps:

step 1, preprocessing text data;

step 2, performing chapter-level semantic analysis on the text data;

and 4, extracting the triples based on the end-to-end samples.

2. The method of claim 1, wherein step 1 comprises the steps of:

step 1-1, converting a text data format;

step 1-3, text data chapter structure processing: splitting a long document into text blocks;

3. The method of claim 2, wherein steps 1-2 comprise: sequentially executing the following processing on the text data after format conversion: conversion of full corners and half corners, conversion of capital letters into lowercase numbers, conversion of capital letters into lowercase letters, removal of emoticons, removal of all characters in the text, and retention of only Chinese, Chinese text segmentation, conversion of traditional simplified Chinese, and filtering of stop words of Chinese text.

4. The method of claim 3, wherein steps 1-4 comprise:

step 1-4-1, for the parentheses in the text block, if the content in the parentheses is in close semantic relation with the left adjacent component, merging the content in the parentheses and the left adjacent component into a semantic component, otherwise, not processing the parentheses;

step 1-4-2, for quotation marks in the sentence block, if the quotation mark body belongs to one part of a named entity, merging the quotation mark body with the named entity, otherwise, not processing;

and 1-4-3, for other symbols in the sentence block, if the symbols are part of the named entity, combining the other symbols in the sentence block and the related context into a semantic entity, and otherwise, taking the other symbols in the sentence block as marks for dividing the physical sentence block.

5. The method of claim 4, wherein step 2 comprises the steps of:

step 2-2, combining the information structure of the text data, the category of the terms playing a specific role and the category of the text data, and decomposing the complex semantics into a hierarchical semantic structure;

step 2-3, entity alignment is carried out;

and 2-4, extracting the latest syntactic dependency verb by the entity.

6. The method according to claim 5, wherein in step 2-2, each level in the hierarchical semantic structure contains N semantic blocks related to facts or concepts, and N is a natural number; and according to the sequence of subsequent traversal, preferentially executing query operation on the semantic blocks of the nested layer to determine the extension of the nested layer, and after the processing of the nested layer is finished, executing query operation on the semantic blocks of other facts or concepts to determine the extension of each semantic block.

7. The method of claim 6, wherein steps 2-3 comprise:

judging whether a same-name entity exists in a pre-established entity library according to the entity name, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the same-name entity pairs, calculating the similarity between the target entity pair and each acquired entity pair, comprehensively scoring candidate ordering of the calculated result according to the similarity of the category label, the attribute label and the unstructured text keywords, if the score is less than a threshold value, adding the target entity into the entity library, otherwise, selecting the alignment result with the highest score as the target entity.

8. The method of claim 7, wherein steps 2-4 comprise:

step 2-4-1, setting two different entities as e_iAnd e_jRespectively extracting the compounds with the following method_iAnd e_jDependency associated node e 'with parallel structure or centered structure'_iAnd e'_j: setting the current node as a father node of e, continuously traversing all nodes if the dependency relationship of the father node is a parallel structure or a fixed structure relationship, continuously traversing if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship, and returning to the father node if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship;

step 2-4-2, the 2 nd entity e is extracted by the following method_jIs dependent on associated node e'_jVerb V with closest dependency relationship_j: initializationSetting the current node as a father node of e when the return value is null; and when the current node is not the root node, executing judgment: if the verb node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;

step 2-4-3, obtaining the 1 st entity e by the following method_iIs dependent on associated node e'_iVerb V with nearest distance for occurrence of cardinal-to-predicate relation or prepositive object relation_i: initializing a return value to null, and setting a current node as a father node of e; and when the current node is not the root node, executing judgment: if the verb node is a verb node and the entity have a dominating relation or a preposing object relation, the verb node is a verb which is closest to the dependency relation of the entity e, the circulation is ended, the verb node is returned to be the verb which is closest to the dependency relation to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;

9. The method of claim 8, wherein step 3 comprises the steps of:

step 3-2, generating extraction knowledge according to the parsing tree generated by the training corpus and the parameter mapping, specifically comprising:

3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a cross sentence block identification rule according to the association points;

10. The method of claim 9, wherein step 4 comprises the steps of:

step 4-2, performing correlation reasoning by using a first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts to obtain a unified first-order logic formula after default recovery, reference resolution and entity unification;

4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or a native target structure fragment to obtain a coupled target structure fragment;