CN113312922A - Improved chapter-level triple information extraction method - Google Patents

Improved chapter-level triple information extraction method Download PDF

Info

Publication number
CN113312922A
CN113312922A CN202110399643.8A CN202110399643A CN113312922A CN 113312922 A CN113312922 A CN 113312922A CN 202110399643 A CN202110399643 A CN 202110399643A CN 113312922 A CN113312922 A CN 113312922A
Authority
CN
China
Prior art keywords
entity
node
verb
semantic
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110399643.8A
Other languages
Chinese (zh)
Other versions
CN113312922B (en
Inventor
李少锋
王妍妍
王玉坤
高菁
陈文颖
张春晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202110399643.8A priority Critical patent/CN113312922B/en
Publication of CN113312922A publication Critical patent/CN113312922A/en
Application granted granted Critical
Publication of CN113312922B publication Critical patent/CN113312922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an improved chapter-level triple information extraction method, which comprises the following steps: firstly, preprocessing text data; secondly, performing chapter-level semantic analysis on the text data, wherein the chapter-level semantic analysis comprises hierarchical semantic analysis, entity alignment and dependency verb extraction; thirdly, carrying out heuristic learning by adopting a multi-round iteration mode to construct an event semantic model; fourthly, extracting the triples based on the end-to-end samples and extracting the triples based on chapter understanding; and fifthly, applying the triple knowledge extracted in the third step and the fourth step to intelligent retrieval, intelligent question answering, knowledge mining, decision support and the like. The method realizes the establishment of the triple information extraction model based on the small sample, has chapter-level triple extraction capability, is easy to popularize, has expansibility, and is an important basic link for large-scale text information data extraction, knowledge system establishment and vertical field knowledge map construction.

Description

Improved chapter-level triple information extraction method
Technical Field
The invention relates to an improved chapter-level triple information extraction method.
Background
The research of natural language processing starts from the skill of vocabularies and dictionaries, and sentences are always taken as the most core research objects in recent years to theoretically explore semantic analysis multi-affair linguists of chapters; while discourse levels lack formal markup such that there has been no particularly significant progress in linguistic computation at the discourse level. However, many semantic problems must be solved fundamentally at the chapter level, such as coreference resolution, chapter structure and semantic relationship recognition, event fusion and relationship recognition, and the like; meanwhile, the solution of these discourse-level semantic problems has a contradictory guiding meaning for the analysis of the vocabulary level and the sentence level. In another aspect. In recent years, the development of Chinese vocabulary and sentence level natural language processing technology, especially the stage results of word sense disambiguation, syntactic analysis, semantic role labeling and other research works, also creates technical conditions for the research of chapter semantic analysis.
Generally, a chinese sentence pattern is usually long, and one sentence often includes a plurality of entity information, so that the number of entity pairs formed therefrom is also large, and the number distribution of entity types is not uniform. Compared with the relation exploration and relation extraction of simple sentences, the sentence pattern of a long sentence is complex, so that the tasks of relation detection and relation extraction are more difficult; long sentences often include multiple entity information, and multiple verbs often exist in sentences in which pairs of entities across long distances. Therefore, how to select verbs which can effectively represent whether semantic relations exist between entity pairs and specific relation types becomes a key for relation exploration and relation extraction; the biggest challenge in current extraction is that training data is insufficient, and the distribution of relationship examples on each category is extremely unbalanced. At present, the means for extracting the entity relationship mainly comprises means based on a template, dependency syntax analysis, deep learning and the like. However, the main problem of the entity relationship extraction based on the template is that the accuracy and recall rate are low. Entity relationship extraction based on dependency syntax faces the problem of semantic loss. Entity relationship extraction based on deep learning obtains better experimental results in some fields, and has no obvious performance difference, but the cost is that a large number of training and testing samples need to be marked on predefined relationship types, the samples are relatively simple short sentences, and the sample distribution of each relationship is relatively uniform. However, manually labeling the sentence-level data accurately is very expensive, and requires a lot of time and labor. In a practical scenario, relying on manual labeling to train data is an almost impossible task to accomplish for thousands of relationships, tens of millions of entity pairs, and hundreds of millions of sentences. Meanwhile, in practical situations, the occurrence frequency of the relationships among the entities and the entity pairs is often subjected to long tail distribution, and a large number of relationships or entity pairs with few samples exist. The effect of the neural network model needs to be guaranteed by relying on large-scale annotation data, and the problem of 'lifting ten against one' exists. How to improve the learning ability of the depth model and realize 'one-to-one-against-three' is a problem to be solved by relation extraction. Furthermore, existing models extract relationships between entities primarily from a single sentence, requiring that the sentence must contain both entities. In fact, a large number of inter-entity relationships are often represented in multiple sentences of a document, and even in multiple documents. How to perform relationship extraction in a more complex context is also a problem faced by relationship extraction. In the existing task setting, a predefined closed relation set is generally assumed, and tasks are converted into a relation classification problem. In this way, the new relationships between entities contained in the text cannot be effectively obtained. The above approach achieves a certain effect on a test set which is relatively simple short sentences and has relatively uniform sample distribution of each relationship, but in practical application, especially in extraction of triple information of text at chapter level, there are many problems, such as data scale, learning ability, complex context, open relationship, and the like. If a theory and a method system of chapter semantic analysis with theoretical depth and practical feasibility can be established, the method has important significance for the development of natural language processing academy and application.
In the information era, how to mine and establish a comprehensive and accurate knowledge system from massive text data and related reports, construct a vertical domain knowledge graph, and perform subsequent intelligent search, intelligent question and answer knowledge mining, decision support and other subsequent applications become technical problems, the chapter-level triple information extraction method is an effective means, and a set of method capable of accurately extracting high-quality entity association relation based on a small amount of labeled samples is needed in order to enable knowledge information extracted from chapters to be applied in the industry on a large scale.
Disclosure of Invention
The purpose of the invention is as follows: the method for extracting the chapter-level triple information is provided for mining and establishing a comprehensive and accurate knowledge system and knowledge map from mass text data and related report annual reports, the natural language processing technology and the machine learning algorithm are utilized, the high-quality entity association relation extraction based on the limited sample condition is realized, the vertical-field knowledge map is established, the establishment of the field knowledge system is powerfully supported, and the information relation mining and studying and judging are assisted.
In order to solve the above technical problem, the present invention provides an improved chapter-level triple information extraction method, including the following steps:
step 1, preprocessing text data;
step 2, performing chapter-level semantic analysis on the text data;
step 3, carrying out heuristic learning by adopting a multi-round iteration mode, and constructing an event semantic model;
and 4, extracting the triples based on the end-to-end samples.
The step 1 comprises the following steps:
step 1-1, converting the format of the text data, namely converting the format of the acquired text data into a format which can be directly subjected to natural language processing by adopting the existing natural language processing technology, such as extracting texts from pdf and doc;
step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology;
step 1-3, text data chapter structure processing: splitting a long document into text blocks by paragraphs and periods;
and 1-4, splitting text data sentence blocks, and further splitting the text blocks into physical sentence blocks with punctuation intervals.
The step 1-2 comprises the following steps: sequentially executing the following processing on the text data after format conversion: conversion of full corners and half corners, conversion of capital letters into lowercase numbers, conversion of capital letters into lowercase letters, removal of emoticons, removal of all characters in the text, and retention of only Chinese, Chinese text segmentation, conversion of traditional simplified Chinese, and filtering of stop words of Chinese text.
The steps 1 to 4 comprise:
step 1-4-1, for the parentheses in the text block, if the content in the parentheses and the adjacent component on the left side thereof are in a semantic relationship close (the semantic component relationship in the same semantic fragment is close, and the semantic component relationship of different semantic fragments is not close, for example, the relationship between the subject and the object in the fragment 1 is close, and the relationship between the subject in the fragment 1 and the object in the fragment 2 is not close), merging the content in the parentheses and the adjacent text component on the left side of the parentheses into a semantic component, otherwise, not processing the parentheses;
step 1-4-2, for quotation marks in sentence blocks, if quotation mark bodies belong to one part of a named entity (the named entity refers to an entity with specific significance in a text and mainly comprises a name of a person, a place name, a mechanism name, a proper noun and the like; a named entity library can be established), merging the quotation mark bodies and the named entities, otherwise, not processing;
and 1-4-3, for other symbols in the sentence block, if the symbols are part of the named entity, combining the other symbols in the sentence block (such as interval numbers in foreign names, book title numbers added to some books and the like) and related contexts into a semantic entity, and otherwise, taking the other symbols in the sentence block as marks for dividing the physical sentence block.
The step 2 comprises the following steps:
step 2-1, performing semantic analysis on continuous texts in chapters by using known syntactic and syntactic knowledge of linguistics, and respectively generating a list consisting of parse trees for each continuous text block;
step 2-2, decomposing complex semantics into a hierarchical semantic structure by combining the information structure of the text data, the category of the terms playing a specific role and the category of the text data;
step 2-3, entity alignment is carried out;
and 2-4, extracting the latest syntactic dependency verb by the entity.
In step 2-2, each level in the hierarchical semantic structure comprises N semantic blocks related to facts or concepts, and N is a natural number; according to the sequence of subsequent traversal, preferentially executing query operation on semantic blocks of a nested layer (the nested layer is a semantic block with nested semantics in the semantic blocks, the complex semantics are decomposed into a hierarchical semantic structure through the step 2-2, and a plurality of semantics can be nested), determining the extension of the nested layer, and after the processing of the nested layer is finished, executing query operation on the semantic blocks of other facts or concepts, and determining the extension of each semantic block.
The step 2-3 comprises the following steps:
judging whether a same-name entity exists in a pre-established entity library according to the entity name, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the same-name entity pairs, calculating the similarity between the target entity pair and each entity pair, comprehensively scoring candidate ordering according to the respective similarities of the category label, the attribute label and the unstructured text keywords, if the score is smaller than a threshold (the size of the threshold cannot be quantized, and needs to be adjusted timely according to specific conditions), adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity. Entity alignment is the determination of whether two or more entities from different sources point to the same object in the real world. If a plurality of entities represent the same object, an alignment relation is constructed among the entities, and meanwhile information contained in the entities is fused and aggregated. The target entity is an entity extracted from the text, and the purpose here is to determine whether the entity in the text and the entity in the entity library have a co-reference relationship.
The steps 2-4 comprise:
step 2-4-1, setting two different entities as eiAnd ejRespectively extracting the compounds with the following methodiAnd ejDependency associated node e with parallel structure or middle structurei' and ej': setting the current node as a father node of e, continuously traversing all nodes if the dependency relationship of the father node is a parallel structure or a fixed structure relationship, continuously traversing if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship, and returning to the father node if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship;
step 2-4-2, the 2 nd entity e is extracted by the following methodjDependent association node e ofj' verb V whose dependency relationship is closest to occurrencej: initializing a return value to null, and setting a current node as a father node of e; and when the current node is not the root node, executing judgment: if the verb node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, if not, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-3, obtaining the 1 st entity e by the following methodiDependent association node e ofi' verb V closest to subject relation or preposition object relationi: initializing the return value to null value null, and setting the current nodeA parent node of e; and when the current node is not the root node, executing judgment: if the verb node is a verb node and the entity have a dominating-predicate relationship or a preposing object relationship, the verb node is a verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-4, through judging verb ViAnd VjWhether the entity pair is the same verb or the parallel structure relationship is determined<eiej>Depends on the verb DV, thereby determining a triplet.
The step 3 comprises the following steps:
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and association knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to the parsing tree generated by the training corpus and the parameter mapping, specifically comprising: step 3-2-1, independently constructing a mapping rule for each semantic level with parameter mapping; the mapping rule refers to a rule from a specific semantic level to a target structure segment;
step 3-2-2, if parameter mappings at different levels exist in the same parse tree, constructing an identification rule containing the levels according to the nested points (preferentially utilizing a target structure to construct the identification rule, and utilizing a semantic structure instead when the target structure cannot be utilized); a nesting point refers to a sentence of text containing a plurality of semantic phrases; the identification rule refers to that for the same target structure, the parse trees with the default components and the reference components can complete the components by contrasting with the complete parse tree;
3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a cross sentence block identification rule according to the association points; the association point is a connection point formed by default and reference relations among different sentence blocks, namely a precedent and a reference in the reference, and a precedent and a default in the default;
3-2-4, if more than two sentence blocks appear in the end sample, the sentence blocks contain parameter mapping, and the end sample does not provide associated marking information about the sentence blocks, actively prompting a user to supplement corresponding associated marking;
and 3-2-5, if the center component modified and limited by the modifier in one layer is extracted, and other components in the layer are not extracted, the layer is not processed.
The step 4 comprises the following steps:
step 4-1, obtaining a primary first-order logic formula according to the hierarchical semantic structure of the input text;
step 4-2, performing association reasoning by using a first-order logic formula (the first-order logic formula is obtained by text semantic analysis, can be a rule or a fact), and realizing the variable unification of the first-order logic formula by using the default, reference and unification relations among contexts to obtain a unified first-order logic formula after default recovery, reference resolution and entity unification;
4-3, mapping reasoning is carried out by utilizing a unified first-order logic formula, and each independent first-order logic formula can possibly generate a primary target structure segment;
4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or a primary target structure fragment to obtain a coupling target structure fragment;
step 4-5, if the predicates of the two adjacent or overlapped coupling target structure segments at the positions are the same, but the subjects and objects corresponding to the predicates in the text phrases are completely different, or the values of the same parameters are also the same, directly combining the two adjacent or overlapped coupling target structure segments at the positions into a larger target structure as final output; otherwise, executing step 4-6;
step 4-6, regarding two coupled target structure segments with adjacent or overlapped positions as different target structure examples of the same predicate, and taking the different target structure examples as final output;
and 4-7, repeating the steps 4-5 and 4-6 until no new and larger coupling target segments are generated, and obtaining all target structure examples which are final outputs.
The invention also comprises a step 5, and some applications of the triple knowledge extracted in the steps 3 and 4, such as intelligent retrieval, intelligent question answering, knowledge mining, decision support and the like.
Compared with the prior art, the invention has the following remarkable advantages:
(1) the method adopts a hierarchical semantic analysis technology based on a semantic mode, realizes heuristic learning aiming at end-to-end samples by utilizing the hierarchical semantic analysis technology, can achieve the effect of learning against one another, realizes extraction of triple information on the basis of discourse-level understanding, and ensures that the extraction result of the triple information is complete and available;
(2) and realizing small sample training through heuristic learning. Because knowledge used in the event semantic model is based on the semantic mode, and the semantic mode is highly multiplexed in natural language expression, one end sample can contribute to highly multiplexed extraction knowledge, and therefore training can be completed without a huge amount of samples, and the problem of lack of effective samples is effectively solved.
(3) The method is based on chapter-level semantic analysis, has expansibility, and can be used for extracting not only binary relations (three-tuple) but also multivariate relations;
(4) the method has high accuracy and recall rate, and is an effective means for forming a high-quality knowledge map in the vertical field and realizing intelligent analysis of the field knowledge.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a block flow diagram of the present invention.
FIG. 2 is a flow chart of the text data preprocessing of the present invention.
Fig. 3 is a flow chart of entity alignment of the present invention.
FIG. 4 is an exemplary diagram of a hierarchical semantic structure of the present invention.
Detailed Description
Aiming at the common problems of the triple information extraction fields such as inaccurate extraction information, large training sample size, high cost and the like in the conventional triple extraction, the method adopts a hierarchical semantic analysis technology based on a semantic mode to establish an event semantic model, forcefully captures entity relationships and information structures contained in texts, reduces the number of required samples by adopting heuristic learning, realizes chapter-level triple information extraction, can effectively solve or improve the problems of data size, learning capacity, complex context, open relationship and the like, and can form a high-quality vertical field knowledge map. The invention provides an improved chapter-level triple information extraction method, as shown in fig. 1, including:
step 1, preprocessing text data;
step 1-1, converting the format of text data, and extracting effective text content from documents in pdf, docx and other formats;
and 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology. The converted text data may contain useless information such as advertisements, special characters without practical significance and the like, and the text data is preprocessed by adopting a natural language processing technology, wherein the preprocessing comprises the following steps: the conversion of full corners and half corners, the conversion of capital letters into lowercase numbers, the conversion of capital letters into lowercase letters, the removal of emoticons, the removal of all characters in the text (only Chinese is reserved), the division of Chinese text, the conversion of traditional Chinese text into simplified Chinese, the filtration of stop words of Chinese text and the like, wherein the preprocessing flow chart is shown in figure 2;
step 1-3, text data chapter structure processing, splitting a longer document into a plurality of text blocks (knowledge points);
step 1-4, text data sentence block splitting, text block further splitting into punctuation mark interval physical sentence blocks, specifically comprising:
step 1-4-1, for the parentheses in the sentence block, if the content in the parentheses is in close coupling relation with the left adjacent component, combining the two into a semantic component, otherwise, processing the parentheses in addition;
step 1-4-2, for quotation marks in the sentence block, if the quotation mark body belongs to a part of a certain named entity, merging the quotation mark body with the named entity, otherwise, not processing;
step 1-4-3, for other symbols in the sentence block, if the symbol is a part of the named entity, combining the punctuation symbol and the related context into a semantic entity, otherwise, taking the punctuation symbol as a mark for dividing the physical sentence block;
step 2, performing chapter-level semantic analysis on the text data;
step 2-1, carrying out semantic analysis on continuous texts in the chapters by using known linguistic knowledge, and respectively generating a list consisting of an analytic tree for each continuous text block;
step 2-2, combining the information structure of the text data, the category of the terms that play a specific role, the category of the text data, decomposing the complex semantics into a hierarchical semantic structure, as follows step 2-3, step 2-4, the hierarchical semantic structure is exemplified as shown in fig. 4 (adding the following contents: the text "edison in the figure invented the incandescent lamp that illuminates the night," actually, the text "edison invented the incandescent lamp" and "the incandescent lamp lighted the night" are nested, specifically, the text "fact | edison basically expresses 1," fact | edison, invented, the incandescent lamp that illuminates the night "constitutes the first layer of meaning, wherein" edison "is the fact," the incandescent lamp "is the fact," and "the incandescent lamp that illuminates the night" constitutes the nested sublayer about "the incandescent lamp," and may also say, "the incandescent lamp that" illuminates the night "is a phrase that takes" the incandescent lamp "as the central word. The two layers of semantics are coupled together as a nested point. ) (ii) a
Step 2-3, obtaining the hierarchical semantic structure, wherein each hierarchy comprises a plurality of semantic blocks related to facts or concepts;
2-4, according to the sequence of subsequent traversal, preferentially executing operations such as query on the semantic blocks of the nested layer, determining the extension of the semantic blocks, and so on;
2-5, as shown in fig. 3, performing entity alignment, firstly judging whether an entity with the same name exists in an entity library according to the name of the entity, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the entity pairs with the same name, calculating the similarity between the target entity pair and each acquired entity pair, comprehensively scoring candidate ordering of the calculated result according to the respective similarities of the category label, the attribute label and the unstructured text keywords, if the score is smaller than a threshold value, adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity;
step 2-6, extracting the latest syntactic dependency verb by the entity, wherein the specific steps are as follows, step 2-7, step 2-8, step 2-9 and step 2-10;
step 2-7, respectively extracting and entity eiAnd ejDependency associated node e with parallel structure or middle structurei' and ej', as algorithm 2-1;
step 2-8, extracting and 2 nd entity ejDependent association node e ofj' verb V whose occurrence dependency relationship is closest tojSuch as algorithm 2-2;
step 2-9, obtaining the 1 st entity eiDependent association node e ofiVerb V with nearest distance for occurrence of cardinal-to-predicate relation or prepositive object relationiSuch as algorithm 2-3;
step 2-10, judging verb ViAnd VjWhether the entity pair is the same verb or the parallel structure relationship is determined<eiej>Depends on the verb DV, from which a triple can be determined;
algorithm 2-1, extracting entity dependent association node
Figure BDA0003019635140000091
Algorithm 2-2, extracting verb closest to the dependency relationship of the 2 nd entity
Figure BDA0003019635140000092
Algorithm 2-3, extracting verb with nearest main-predicate relation or preposition object relation with the 1 st entity
Figure BDA0003019635140000093
Step 3, carrying out heuristic learning by adopting a multi-round iteration mode, and constructing an event semantic model;
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and association knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to the analysis tree generated by the end sample and the parameter mapping, and specifically comprising the following steps:
step 3-2-1, independently constructing a mapping rule for each semantic level with parameter mapping;
3-2-2, if parameter mappings at different levels exist in the same parse tree, constructing an identification rule containing the levels according to the nested points;
3-2-3, if parameter mapping related to the same target structure exists in different parse trees, trying to construct a cross sentence block recognition rule according to the association points;
step 3-2-4, if a plurality of sentence blocks (namely a plurality of parse trees correspondingly exist) appear in the end sample, and the sentence blocks contain parameter mapping, and the end sample does not provide associated marking information about the sentence blocks, actively prompting a user to supplement corresponding associated marking;
3-2-5, if the central word of a certain level is extracted and other components in the level are not extracted, the level can be ignored;
step 4, extracting the triples based on the end samples;
step 4-1, obtaining a primary first-order logic formula according to the hierarchical semantic structure of the input text;
and 4-2, performing correlation reasoning by using a first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts. Obtaining an unification first-order logic formula after default recovery, reference resolution and entity unification;
4-3, mapping reasoning is carried out by utilizing a unified first-order logic formula, and each independent first-order logic formula can possibly generate a plurality of native target structure fragments;
4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or target structure fragments to obtain a plurality of coupled target structure fragments;
and 4-5, if the predicates of the two coupled target structure segments which are adjacent or overlapped in position are the same, but the parameters are completely different, or the values of the same parameters are also the same, directly combining the two coupled target structure segments into a larger target structure as final output. Otherwise, executing step 4-6;
step 4-6, regarding the two as different target structure examples of the same predicate, and taking the different target structure examples as final output;
and 4-7, repeating the steps 4-5 and 4-6 until no new and larger coupling target segments are generated, and obtaining all target structure examples which are final outputs.
Step 5, some applications of the triple knowledge extracted by the step three and the step four are: for example, the intelligent search and the hundred-degree search of the existing American president mainly display a certain president A and a certain president B, which indicates that the retrieval technology needs to be further improved; the intelligent question answering can be regarded as an extension of semantic search, and can provide not only scene conversation but also knowledge of various industries by applying a chat robot, a knowledge graph which depends on the intelligent question answering is a knowledge graph in the open field, the provided knowledge is very wide, daily knowledge can be provided for a user, and chat type conversation can also be performed; the personalized recommendation system analyzes social relations among users and association relations between the users and products by collecting interest preferences and attributes of the users and product classification, attributes, contents and the like, and deduces the preference and demand of the users by using a personalized algorithm, thereby recommending interesting products or contents for the users; the assistant decision-making is to analyze and process knowledge by using the knowledge of the knowledge graph, and to obtain a certain conclusion through logical reasoning of a certain rule, so as to provide support for the user to decide.
The present invention provides an improved chapter-level triplet information extraction method, and a plurality of methods and approaches for implementing the technical solution, and the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention. All the components not specified in the embodiment can be realized by the prior art.

Claims (10)

1. An improved chapter-level triple information extraction method is characterized by comprising the following steps:
step 1, preprocessing text data;
step 2, performing chapter-level semantic analysis on the text data;
step 3, carrying out heuristic learning by adopting a multi-round iteration mode, and constructing an event semantic model;
and 4, extracting the triples based on the end-to-end samples.
2. The method of claim 1, wherein step 1 comprises the steps of:
step 1-1, converting a text data format;
step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology;
step 1-3, text data chapter structure processing: splitting a long document into text blocks;
and 1-4, splitting text data sentence blocks, and further splitting the text blocks into physical sentence blocks with punctuation intervals.
3. The method of claim 2, wherein steps 1-2 comprise: sequentially executing the following processing on the text data after format conversion: conversion of full corners and half corners, conversion of capital letters into lowercase numbers, conversion of capital letters into lowercase letters, removal of emoticons, removal of all characters in the text, and retention of only Chinese, Chinese text segmentation, conversion of traditional simplified Chinese, and filtering of stop words of Chinese text.
4. The method of claim 3, wherein steps 1-4 comprise:
step 1-4-1, for the parentheses in the text block, if the content in the parentheses is in close semantic relation with the left adjacent component, merging the content in the parentheses and the left adjacent component into a semantic component, otherwise, not processing the parentheses;
step 1-4-2, for quotation marks in the sentence block, if the quotation mark body belongs to one part of a named entity, merging the quotation mark body with the named entity, otherwise, not processing;
and 1-4-3, for other symbols in the sentence block, if the symbols are part of the named entity, combining the other symbols in the sentence block and the related context into a semantic entity, and otherwise, taking the other symbols in the sentence block as marks for dividing the physical sentence block.
5. The method of claim 4, wherein step 2 comprises the steps of:
step 2-1, performing semantic analysis on continuous texts in chapters by using known syntactic and syntactic knowledge of linguistics, and respectively generating a list consisting of parse trees for each continuous text block;
step 2-2, combining the information structure of the text data, the category of the terms playing a specific role and the category of the text data, and decomposing the complex semantics into a hierarchical semantic structure;
step 2-3, entity alignment is carried out;
and 2-4, extracting the latest syntactic dependency verb by the entity.
6. The method according to claim 5, wherein in step 2-2, each level in the hierarchical semantic structure contains N semantic blocks related to facts or concepts, and N is a natural number; and according to the sequence of subsequent traversal, preferentially executing query operation on the semantic blocks of the nested layer to determine the extension of the nested layer, and after the processing of the nested layer is finished, executing query operation on the semantic blocks of other facts or concepts to determine the extension of each semantic block.
7. The method of claim 6, wherein steps 2-3 comprise:
judging whether a same-name entity exists in a pre-established entity library according to the entity name, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the same-name entity pairs, calculating the similarity between the target entity pair and each acquired entity pair, comprehensively scoring candidate ordering of the calculated result according to the similarity of the category label, the attribute label and the unstructured text keywords, if the score is less than a threshold value, adding the target entity into the entity library, otherwise, selecting the alignment result with the highest score as the target entity.
8. The method of claim 7, wherein steps 2-4 comprise:
step 2-4-1, setting two different entities as eiAnd ejRespectively extracting the compounds with the following methodiAnd ejDependency associated node e 'with parallel structure or centered structure'iAnd e'j: setting the current node as a father node of e, continuously traversing all nodes if the dependency relationship of the father node is a parallel structure or a fixed structure relationship, continuously traversing if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship, and returning to the father node if the dependency relationship of the father node is the condition of the parallel structure or the fixed structure relationship;
step 2-4-2, the 2 nd entity e is extracted by the following methodjIs dependent on associated node e'jVerb V with closest dependency relationshipj: initializationSetting the current node as a father node of e when the return value is null; and when the current node is not the root node, executing judgment: if the verb node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, the circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-3, obtaining the 1 st entity e by the following methodiIs dependent on associated node e'iVerb V with nearest distance for occurrence of cardinal-to-predicate relation or prepositive object relationi: initializing a return value to null, and setting a current node as a father node of e; and when the current node is not the root node, executing judgment: if the verb node is a verb node and the entity have a dominating relation or a preposing object relation, the verb node is a verb which is closest to the dependency relation of the entity e, the circulation is ended, the verb node is returned to be the verb which is closest to the dependency relation to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-4, through judging verb ViAnd VjWhether the entity pair is the same verb or the parallel structure relationship is determined<eiej>Depends on the verb DV, thereby determining a triplet.
9. The method of claim 8, wherein step 3 comprises the steps of:
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and association knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to the parsing tree generated by the training corpus and the parameter mapping, specifically comprising:
step 3-2-1, independently constructing a mapping rule for each semantic level with parameter mapping;
3-2-2, if parameter mappings at different levels exist in the same parse tree, constructing an identification rule containing the levels according to the nested points;
3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a cross sentence block identification rule according to the association points;
3-2-4, if more than two sentence blocks appear in the end sample, the sentence blocks contain parameter mapping, and the end sample does not provide associated marking information about the sentence blocks, actively prompting a user to supplement corresponding associated marking;
and 3-2-5, if the center component modified and limited by the modifier in one layer is extracted, and other components in the layer are not extracted, the layer is not processed.
10. The method of claim 9, wherein step 4 comprises the steps of:
step 4-1, obtaining a primary first-order logic formula according to the hierarchical semantic structure of the input text;
step 4-2, performing correlation reasoning by using a first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts to obtain a unified first-order logic formula after default recovery, reference resolution and entity unification;
4-3, mapping reasoning is carried out by utilizing a unified first-order logic formula, and each independent first-order logic formula can possibly generate a primary target structure segment;
4-4, carrying out identification reasoning by utilizing an integrated first-order logic formula or a native target structure fragment to obtain a coupled target structure fragment;
step 4-5, if the predicates of the two adjacent or overlapped coupling target structure segments at the positions are the same, but the subjects and objects corresponding to the predicates in the text phrases are completely different, or the values of the same parameters are also the same, directly combining the two adjacent or overlapped coupling target structure segments at the positions into a larger target structure as final output; otherwise, executing step 4-6;
step 4-6, regarding two coupled target structure segments with adjacent or overlapped positions as different target structure examples of the same predicate, and taking the different target structure examples as final output;
and 4-7, repeating the steps 4-5 and 4-6 until no new and larger coupling target segments are generated, and obtaining all target structure examples which are final outputs.
CN202110399643.8A 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method Active CN113312922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110399643.8A CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110399643.8A CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Publications (2)

Publication Number Publication Date
CN113312922A true CN113312922A (en) 2021-08-27
CN113312922B CN113312922B (en) 2023-10-24

Family

ID=77372136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110399643.8A Active CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Country Status (1)

Country Link
CN (1) CN113312922B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707520A (en) * 2022-06-06 2022-07-05 天津大学 Session-oriented semantic dependency analysis method and device
CN115081437A (en) * 2022-07-20 2022-09-20 中国电子科技集团公司第三十研究所 Machine-generated text detection method and system based on linguistic feature contrast learning
CN117094396A (en) * 2023-10-19 2023-11-21 北京英视睿达科技股份有限公司 Knowledge extraction method, knowledge extraction device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109446338A (en) * 2018-09-20 2019-03-08 大连交通大学 Drug disease relationship classification method neural network based
CA3060811A1 (en) * 2018-10-31 2020-04-30 Royal Bank Of Canada System and method for cross-domain transferable neural coherence model
CN111274790A (en) * 2020-02-13 2020-06-12 东南大学 Chapter-level event embedding method and device based on syntactic dependency graph
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109446338A (en) * 2018-09-20 2019-03-08 大连交通大学 Drug disease relationship classification method neural network based
CA3060811A1 (en) * 2018-10-31 2020-04-30 Royal Bank Of Canada System and method for cross-domain transferable neural coherence model
CN111274790A (en) * 2020-02-13 2020-06-12 东南大学 Chapter-level event embedding method and device based on syntactic dependency graph
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘一仝: "篇章级事件表示及相关性计算", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 02, pages 138 - 2371 *
黄培馨等: "融合对抗训练的端到端知识三元组联合抽取", 《计算机研究与发展》, vol. 56, no. 12, pages 2536 - 2548 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707520A (en) * 2022-06-06 2022-07-05 天津大学 Session-oriented semantic dependency analysis method and device
CN114707520B (en) * 2022-06-06 2022-09-13 天津大学 Session-oriented semantic dependency analysis method and device
CN115081437A (en) * 2022-07-20 2022-09-20 中国电子科技集团公司第三十研究所 Machine-generated text detection method and system based on linguistic feature contrast learning
CN117094396A (en) * 2023-10-19 2023-11-21 北京英视睿达科技股份有限公司 Knowledge extraction method, knowledge extraction device, computer equipment and storage medium
CN117094396B (en) * 2023-10-19 2024-01-23 北京英视睿达科技股份有限公司 Knowledge extraction method, knowledge extraction device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113312922B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN109684448B (en) Intelligent question and answer method
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111209412B (en) Periodical literature knowledge graph construction method for cyclic updating iteration
CN110287494A (en) A method of the short text Similarity matching based on deep learning BERT algorithm
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN113312922B (en) Improved chapter-level triple information extraction method
CN101079025B (en) File correlation computing system and method
CN110609983B (en) Structured decomposition method for policy file
CN108984661A (en) Entity alignment schemes and device in a kind of knowledge mapping
CN111061882A (en) Knowledge graph construction method
CN113221559B (en) Method and system for extracting Chinese key phrase in scientific and technological innovation field by utilizing semantic features
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN113168499A (en) Method for searching patent document
CN113196277A (en) System for retrieving natural language documents
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
Bounhas et al. A hybrid possibilistic approach for Arabic full morphological disambiguation
CN112733547A (en) Chinese question semantic understanding method by utilizing semantic dependency analysis
CN114997288A (en) Design resource association method
CN113343706A (en) Text depression tendency detection system based on multi-modal features and semantic rules
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN111178080A (en) Named entity identification method and system based on structured information
Sun A natural language interface for querying graph databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant