CN110390099B

CN110390099B - Object relation extraction system and method based on template library

Info

Publication number: CN110390099B
Application number: CN201910583405.5A
Authority: CN
Inventors: 冯钧; 柳菁铧
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2023-01-31
Anticipated expiration: 2039-06-28
Also published as: CN110390099A

Abstract

The invention discloses an object relation extraction system and an object relation extraction method based on a template library. The information frame extraction module realizes the extraction of the triples of the information frames of each corpus; then the attribute name merging module merges similar attribute names by using a synonym table so as to solve the phenomenon of one meaning multiple words of the attribute names; and finally, the object relation extraction module builds a template library according to the triples extracted by the information frame to realize the extraction of the object relation triples in the text. According to the method, the processed information frame triples are used as the relation seeds, the templates are generalized through feature clustering and editing distance, and finally a relation template library is constructed, so that the relation extraction effect is improved.

Description

Object relation extraction system and method based on template library

Technical Field

The invention relates to an information technology processing technology, in particular to an object relation extraction system and an object relation extraction method based on a template library.

Background

In recent years, the water conservancy industry in China is continuously developed vigorously, a large amount of water conservancy data is brought by application of various monitoring tools and communication technologies, and massive data becomes an important basis for promoting water conservancy informatization. On the other hand, the rapid development of the internet also accumulates a large amount of information, and the information contains valuable water conservancy knowledge, but the water conservancy knowledge has wide sources and complex structure, and is difficult to be directly and effectively applied in practice. Only by means of scientific and effective method can these water conservancy data be organized and then utilized. The existing water conservancy domain knowledge graph is obtained by mapping and constructing an existing water conservancy database, and the following problems are also caused: (1) The method is limited by the design of a database table of the database, and the relationship between the entities obtained by mapping is single. (2) knowledge has depth but lacks breadth. (3) knowledge updates are relatively slow. It is therefore desirable to extract knowledge from the internet to enrich the local knowledge base.

By further analyzing the content of each corpus and the structure of the local knowledge base, the following problems exist in the process of relationship extraction and entity linkage: first, if the information frame information of each corpus is extracted by a conventional semi-structured method, problems of ambiguous words and irregular attribute values occur. The term "ambiguous word" means that when writing an entry page, different writers have different expression modes, and the same attribute may have different attribute names, for example, for the same attribute "location", there may be attribute names of "place", "location", and the like. Attribute value irregularity means that there are some attribute values that are composed of text or values. These preliminary derived extractions are not of high quality and cannot be added to the local knowledge base. Secondly, the information in the information frames of each corpus is unbalanced in distribution, the information frames of some entry pages have a large amount of information, the information frames of some entry pages are deficient, and even some entry pages have no information frame at all. If only the semi-structured extraction method is adopted to extract the information in the information frame, the knowledge in each corpus cannot be obtained to the maximum extent.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides an object relation extraction system and an object relation extraction method based on a template library. The extraction system obtains a higher-quality triple through an information frame extraction module and an attribute name merging module; constructing a relation template library through an object relation extraction module to realize the extraction of a new relation instance; the extraction method can establish the relation template, so that the relation example can be extracted more accurately.

The technical scheme is as follows: the invention discloses an object relation extraction system based on a template library, which comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; and the object relation extraction module builds a template library according to the seed relation triple to realize the extraction of the object relation triple in the text.

Specifically, the information frame extraction module extracts the relation triple of the entry from the information frame of the corpus; the corpus information frame is a summary description of the item, and the relationship triple of the item can be extracted from the corpus information frame.

Specifically, the attribute name merging module merges similar attribute names in the relationship triples obtained by the information frame extraction module to obtain seed relationship triples; the attribute name merging module firstly obtains core words of attribute names through syntactic analysis, and then calculates the similarity between the attribute names by using a synonym table, so as to merge similar attribute names.

Specifically, the object relationship extraction module is used for constructing a template library and extracting object relationship triples; the object relation extraction module preprocesses the text corresponding to the entry to obtain a training corpus and a test corpus, extracts a sentence example in the training corpus through the seed relation triple, and constructs a feature vector; finally, generalizing all sentence examples through feature clustering and editing distance to construct a relational template library; and extracting a new relation instance from the test corpus through the relation template library.

The invention also discloses an object relation extraction method using the object relation extraction system based on the template library, which comprises the following steps:

step 1) extracting a relation triple of a required item from a corpus information frame by an information frame extracting module;

step 11) if the relation triple attribute value is a phrase formed by multiple words and is not a numerical value or an identifiable named entity, trimming the attribute value and extracting the identifiable named entity as the attribute value;

and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by special symbols such as- ", and the like, segmenting the attribute values according to the special symbols in the attribute values, wherein each segmented result and the entry form a relationship triple.

Step 2) the attribute name merging module merges similar attribute names in the relation triples obtained by the information frame extraction module to obtain a seed relation triples;

step 21) obtaining a description part of the attribute name by syntactic analysis, and deleting the description part;

step 22) calculating the similarity between the attribute names by utilizing the synonym table; if the eight-bit codes of the two attribute names are completely the same, the two attribute names are indicated to be synonyms, and can be combined into the same attribute name;

if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names wordl, word2, finding out their eight-bit codes code1, code2 in the synonym table; taking the first seven bits of the eight-bit code, and layering the first seven bit code according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):

if the level is 0, the similarity is 0, or the two eight-bit codes end at the end of the @, the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown as formula (2), if the last bit is "=", it indicates that the two words are completely equal, and the similarity is 1; if the last bit is "#", it means that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating f (t) =0 at the moment; five levels are weighted by the five-layer structure from top to bottom: 0.65,0.8,0.9,0.96,1;

and (4) calculating a total similarity value between the two attribute names, judging whether to combine the two attribute names, and jumping to the step 3) after the combination is finished.

Step 3) the object relation extraction module extracts the text texts of all the entries in the information frame extraction module; firstly, carrying out noise reduction treatment on an extracted text, and removing redundant hyperlinks and labels in the text; then, sentence division is carried out on the text; finally, word segmentation, part of speech recognition and named entity recognition are carried out on the single sentence;

step 4) extracting sentences in which two entities in the seed relationship triple co-occur in the step 3) as sentence examples of the type relationship;

step 5), extracting n-gram word characteristics, n-gram part-of-speech characteristics and distance characteristics of the sentence example to construct a characteristic vector;

step 6) replacing entity names in the sentence examples to obtain a relation template;

step 7) clustering the relation templates through the characteristics in the step 5), and generalizing the intra-cluster templates according to the editing distance;

step 71) carrying out k-means clustering on the relation template of the required relation, wherein the characteristics come from step 5); clustering to obtain template clusters with similar syntactic structures

P＝{cluster ₁ ，cluster ₂ ，...，cluster _m }；

Step 72) selecting a cluster from P _i Calculating the editing distance of pairwise relation templates in the cluster;

step 73) initializing the relationship template p according to formula (3) _n ，p _m Edit distance matrix Edit:

wherein i has a value range of (1, | p) _n |)，|p _n I denotes p _n Length of (i.e. p) _n A total number of words; j has a value in the range of (1, | p) _m |)，|p _m I represents p _m Length of (i.e. p) _m A total number of words;

step 74) populate the Edit matrix according to equation (4):

Edit(i，j)＝min(1+Edit(i-1，j)，1+Edit(i，j-1)，Edit(i-1，j-1)+d(i，j)) (4)

wherein d (i, j) is used to denote p _n [i]And p _m [j]Whether or not they are identical, p _n [i]And p _m [j]Respectively represent templates p _n I < th > word and modelPlate p _m J word of (1); d (i, j) is calculated as shown in formula (5);

equation (4) indicates that p is _n [i]Conversion to p _m [j]There are three options:

(1) And executing replacement operation: p is to be _n [i]By substitution of p _m [j]When equation (4) Edit [ i, j ]]The minimum value is: edit [ i, j [ ]]＝Edit[i-1，j-1]+ d (i, j), when p _n [i]And p _m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;

(2) And executing a deleting operation: p is to be _n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]＝E[i-1，j]+1；

(3) And executing a deleting operation: p is to be _m [j]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]＝E[i，j-1]+1；

Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different values of the Edit matrix in the formula (4), the matrix D records corresponding operations of different values; the value of D is: i: indicating an insert operation; r: indicating a delete operation; e: representing equivalence, without doing anything; u: indicating a replacement operation;

step 76) from cluster _i Selecting two templates with the minimum editing distance; if the editing distance between the two templates is larger than the threshold value, stopping calculating the cluster _i Inner template, return to step 72) to calculate the next cluster; otherwise jump to step 77);

step 77) let P _g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]The relation template generalization is carried out according to the operation matrix D obtained by the two relation templates with the minimum editing distance to obtain the generalized template P _g ；

Step 78) slave cluster _i Two selected templates with the minimum editing distance are deleted, and a generalization template P is added _g Jump to step 73).

And 8) finishing.

Has the beneficial effects that: the invention discloses an object relation extraction system and an object relation extraction method based on a template library, wherein the extraction system obtains a higher-quality triple through an information frame extraction module and an attribute name merging module; constructing a relation template library through an object relation extraction module to realize the extraction of a new relation instance; according to the extraction method, the relation template base is established, so that the object relation is more accurate in extraction, the efficiency is higher, the knowledge-based local knowledge base can be extracted and updated according to the information of the corpus on the Internet, the knowledge is updated quickly, and the knowledge breadth is higher.

Drawings

FIG. 1 is a schematic diagram of a relationship between modules of an object relationship extraction system based on a template library according to the present invention;

FIG. 2 is a flowchart of an object relationship extraction method based on a template library.

Detailed Description

The invention discloses an object relation extraction system based on a template library, which comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; and the object relation extraction module builds a template library according to the seed relation triple to realize the extraction of the object relation triple in the text.

Specifically, the information frame extraction module extracts the relation triple of the items from the information frame of the corpus; the corpus information box is a summary description of the items, and the relationship triples of the items can be extracted from the corpus information box.

By analyzing the content of wikipedia and the structure of the local knowledge base, the following problems exist in the process of relationship extraction and entity linkage: first, if the information frame information of wikipedia is extracted by using a conventional semi-structured method, problems of ambiguous words and irregular attribute values may occur. The term "ambiguous word" means that when writing an entry page, different writers have different expression modes, and the same attribute may have different attribute names, for example, for the same attribute "location", there may be attribute names of "place", "location", and the like. Attribute value irregularity means that there are some attribute values that are composed of text or multiple values. These preliminary derived extractions are not of high quality and cannot be added to the local knowledge base. Secondly, the information in the Wikipedia information frames is unbalanced in distribution, the information frames of some entry pages have a large amount of information, the information frames of some entry pages are insufficient, and even some entry pages have no information frame at all. If the information in the information frame is extracted by only adopting a semi-structured extraction method, the knowledge in Wikipedia cannot be maximally obtained.

Therefore, the object relationship extraction method using the object relationship extraction system based on the template library by using Wikipedia as a corpus comprises the following steps:

an entry to be operated, such as the three gorges dam, has a relevant link in the wikipedia category directory information, and a relation triple is extracted from an information frame in page content corresponding to the link entering of the entry of the three gorges dam;

the method comprises the following specific steps:

step 11) if the attribute value of the relation triple is a phrase formed by multiple words and is not a numerical value or an identifiable named entity, trimming the attribute value and extracting the identifiable named entity as the attribute value;

and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by special symbols such as "-", "and the like, segmenting the attribute values according to the special symbols in the attribute values, and forming a relationship triple by each segmented result and the entry.

The attribute value refers to an attribute value in an information frame, for example, an attribute key value pair exists in an information frame of a three gorge dam: (the address is located at a position 15KM in the southeast direction of Wuhan city), the relation triple obtained by direct extraction is (the Sanxia dam, the address is located at a position 15KM in the southeast direction of Wuhan city), the relation triple is irregular, the attribute value of 'located at a position 15KM in the southeast direction of Wuhan city' needs to be pruned and simplified into 'Wuhan city', the finally obtained relation triple is (the Sanxia dam, the address and the Wuhan city), and the Wuhan city is a named entity at the moment.

if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names wordl, word2, finding out eight-bit codes (codel, code 2) of the two attribute names wordl and word2 in a synonym table; taking the first seven bits of eight-bit codes, and layering the first seven bits of codes according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):

wherein level is the maximum number of layers of the public string t, if the level is 0, the similarity is 0 if the two codes are completely different, or the two eight-bit codes end with '@' to indicate that the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown in formula (2), if the last bit is "=", which means that two words are completely equal, and the similarity is 1; if the end position is "#", it indicates that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating at the moment f (t) =0; five levels are weighted by the five-layer structure from top to bottom: 0.65,0.8,0.9,0.96,1;

An example of a similarity calculation is as follows:

if the code of "position" is "Cb01B01", the code of "azimuth" is "Cb01a01", the corresponding five-layer codes are "C B01" and "C B01a01", their level is 3, then 0.65+0.8+0.9=2.35; if level =4, the value is 0.65+0.8+0.9+0.96=3.31, and then the value of equation 2 is 0. If level =5, the value is 0.65+0.8+0.9+0.96=3.31, considering equation 2, at this time looking at the last bit of the two codes (this last bit is "#", "=" and "@", not in the five-layer code), "@" has been considered in equation 1; the final calculation of "#", "=" is 3.31+0.5=3.81,3.31+1=4.31, respectively.

step 4) extracting sentences in which two entities in the seed relationship triples coexist in the step 3) as sentence examples of the type relationship;

step 5) extracting n-gram word characteristics, n-gram part-of-speech characteristics and distance characteristics of the sentence example to construct a characteristic vector;

step 6), replacing the entity name in the sentence example to obtain a relation template;

step 71) carrying out k-means clustering on the relation template of the required relation, wherein the characteristics come from step 5); obtaining template clusters with similar syntactic structures through clustering

P＝{cluster ₁ ，cluster ₂ ，...，cluster _m }；

step 74) populate the Edit matrix according to equation (4):

wherein d (i, j) is used to denote p _n [i]And p _m [j]Whether or not they are identical, p _n [i]And p _m [j]Respectively represent templates p _n The ith word and the template p _m The jth word of (1); d (i, j) is calculated as shown in formula (5);

(1) And executing replacement operation: p is to be _n [i]By substitution of p _m [j]When equation (4) Edit [ i, j]The minimum value is: edit [ i, j [ ]]＝Edit[i-1，j-1]+ d (i, j), when p _n [i]And p _m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;

(2) And (3) executing a deleting operation: p is to be _n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]＝E[i-1，j]+1；

Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different value conditions of the Edit matrix in the formula (4), recording corresponding operations of different values by the matrix D; the value in D is: i: representing an insert operation; r: indicating a delete operation; e: representing equivalence without any operation; u: indicating a replacement operation;

step 77) let P _g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]According to two relation templates with minimum editing distanceThe obtained operation matrix D is used for carrying out relational template generalization to obtain a generalized template P _g ；

And 8) finishing.

Claims

1. An object relation extraction system based on a template library comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the method is characterized in that: the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; the object relation extraction module builds a template library according to the seed relation triples to realize the extraction of the object relation triples in the text;

the information frame extraction module extracts the relation triple of the items from the information frame of the corpus; the corpus information frame is used for describing the outline of the item, and the relation triple of the item can be extracted from the corpus information frame;

the attribute name merging module merges similar attribute names in the relationship triples obtained by the information frame extraction module to obtain a seed relationship triplet; the attribute name merging module firstly obtains core words of attribute names through syntactic analysis, and then calculates the similarity between the attribute names by using a synonym table so as to merge similar attribute names;

the object relation extraction module is used for constructing a template library and extracting object relation triples; the object relation extraction module preprocesses the text corresponding to the entry to obtain a training corpus and a test corpus, extracts a sentence example in the training corpus through the seed relation triple, and constructs a feature vector; finally, generalizing all sentence examples through feature clustering and editing distance to construct a relational template library; and extracting a new relation instance from the test corpus through the relation template library.

2. A template library-based object relationship extraction method using the template library-based object relationship extraction system according to claim 1, characterized in that: the method comprises the following steps:

step 2), the attribute name merging module merges similar attribute names in the relation triples obtained by the information frame extraction module to obtain seed relation triples;

step 3), extracting the text of all the entries in the information frame extraction module by an object relation extraction module; firstly, denoising an extracted text, and removing redundant hyperlinks and labels in the text; then, sentence division is carried out on the text; finally, word segmentation, part of speech recognition and named entity recognition are carried out on the single sentence;

and 8) ending.

3. The template library-based object relationship extraction method according to claim 2, wherein:

the specific steps of extracting the information frame relation triples in the step 1) are as follows:

and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by using special symbols, segmenting the attribute values according to the special symbols in the attribute values, wherein each segmented result and the item form a relationship triple.

4. The method for extracting object relationship based on the template library according to claim 2 or 3, wherein the specific step of combining similar attribute names in the step 2) is as follows:

step 22), calculating the similarity between the attribute names by using the synonym table; if the eight-bit codes of the two attribute names are completely the same, the two attribute names are indicated to be synonyms, and can be combined into the same attribute name;

if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names word1, word2, finding out their eight-bit codes code1, code2 in the synonym table; taking the first seven bits of eight-bit codes, and layering the first seven bits of codes according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):

wherein level is the maximum number of layers of the public string t, if the level is 0, the similarity is 0 if the two codes are completely different, or the two eight-bit codes end with '@' to indicate that the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown as formula (2), if the last bit is "=", it indicates that the two words are completely equal, and the similarity is 1; if the end position is "#", it indicates that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating at the moment f (t) =0;

5. The template library-based object relationship extraction method according to claim 4, wherein: the five-layer structure of the step 22) weights the five levels from top to bottom _i Comprises the following steps:

0.65,0.8,0.9,0.96,1。

6. the method for extracting object relationship based on template library of claim 4, wherein the concrete steps of template generalization in step 7) are as follows:

step 71) carrying out k-means clustering on the required relation template, wherein the characteristics come from step 5); obtaining a template cluster P = { cluster with similar syntactic structure through clustering ₁ ,cluster ₂ ,...,cluster _m }；

wherein i has a value range of (1, | p) _n |)，|p _n I denotes p _n Length of (i.e. p) _n A total number of words; the value range of j is (1, | p) _m |)，|p _m I denotes p _m Length of (i.e. p) _m The total number of words;

step 74) populate the Edit matrix according to equation (4):

Edit(i,j)＝min(1+Edit(i-1,j),1+Edit(i,j-1),Edit(i-1,j-1)+d(i,j)) (4)

formula (4) shows that p _n [i]Conversion to p _m [j]There are three options:

(1) And (3) executing replacement operation: p is to be _n [i]By substitution of p _m [j]When equation (4) Edit [ i, j ]]The minimum value is: edit [ i, j [ ]]＝Edit[i-1,j-1]+ d (i, j), when p _n [i]And p _m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;

(2) And (3) executing a deleting operation: p is to be _n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]＝E[i-1,j]+1

(3) And executing a deleting operation: p is to be _m [j]Delete when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]＝E[i,j-1]+1

Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different value conditions of the Edit matrix in the formula (4), recording corresponding operations of different values by the matrix D; the value in D is: i: representing an insert operation; r: indicating a delete operation; e: representing equivalence, without doing anything; u: indicating a replacement operation;

step 76) from cluster _i Selecting two templates with the minimum editing distance; if the edit distance between the two templates is larger than the threshold value, stopping calculating cluster _i Inner template, return to step 72) to calculate the next cluster; otherwise jump to step 77);

step 77) let P _g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]Performing template generalization according to an operation matrix D obtained by the two templates with the minimum editing distance;

step 78) slave cluster _i Delete two inAdding the selected template with the minimum editing distance into a generalized template P _g Jump to step 73).