CN110390099B - Object relation extraction system and method based on template library - Google Patents

Object relation extraction system and method based on template library Download PDF

Info

Publication number
CN110390099B
CN110390099B CN201910583405.5A CN201910583405A CN110390099B CN 110390099 B CN110390099 B CN 110390099B CN 201910583405 A CN201910583405 A CN 201910583405A CN 110390099 B CN110390099 B CN 110390099B
Authority
CN
China
Prior art keywords
relation
edit
attribute
information frame
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910583405.5A
Other languages
Chinese (zh)
Other versions
CN110390099A (en
Inventor
冯钧
柳菁铧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201910583405.5A priority Critical patent/CN110390099B/en
Publication of CN110390099A publication Critical patent/CN110390099A/en
Application granted granted Critical
Publication of CN110390099B publication Critical patent/CN110390099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an object relation extraction system and an object relation extraction method based on a template library. The information frame extraction module realizes the extraction of the triples of the information frames of each corpus; then the attribute name merging module merges similar attribute names by using a synonym table so as to solve the phenomenon of one meaning multiple words of the attribute names; and finally, the object relation extraction module builds a template library according to the triples extracted by the information frame to realize the extraction of the object relation triples in the text. According to the method, the processed information frame triples are used as the relation seeds, the templates are generalized through feature clustering and editing distance, and finally a relation template library is constructed, so that the relation extraction effect is improved.

Description

Object relation extraction system and method based on template library
Technical Field
The invention relates to an information technology processing technology, in particular to an object relation extraction system and an object relation extraction method based on a template library.
Background
In recent years, the water conservancy industry in China is continuously developed vigorously, a large amount of water conservancy data is brought by application of various monitoring tools and communication technologies, and massive data becomes an important basis for promoting water conservancy informatization. On the other hand, the rapid development of the internet also accumulates a large amount of information, and the information contains valuable water conservancy knowledge, but the water conservancy knowledge has wide sources and complex structure, and is difficult to be directly and effectively applied in practice. Only by means of scientific and effective method can these water conservancy data be organized and then utilized. The existing water conservancy domain knowledge graph is obtained by mapping and constructing an existing water conservancy database, and the following problems are also caused: (1) The method is limited by the design of a database table of the database, and the relationship between the entities obtained by mapping is single. (2) knowledge has depth but lacks breadth. (3) knowledge updates are relatively slow. It is therefore desirable to extract knowledge from the internet to enrich the local knowledge base.
By further analyzing the content of each corpus and the structure of the local knowledge base, the following problems exist in the process of relationship extraction and entity linkage: first, if the information frame information of each corpus is extracted by a conventional semi-structured method, problems of ambiguous words and irregular attribute values occur. The term "ambiguous word" means that when writing an entry page, different writers have different expression modes, and the same attribute may have different attribute names, for example, for the same attribute "location", there may be attribute names of "place", "location", and the like. Attribute value irregularity means that there are some attribute values that are composed of text or values. These preliminary derived extractions are not of high quality and cannot be added to the local knowledge base. Secondly, the information in the information frames of each corpus is unbalanced in distribution, the information frames of some entry pages have a large amount of information, the information frames of some entry pages are deficient, and even some entry pages have no information frame at all. If only the semi-structured extraction method is adopted to extract the information in the information frame, the knowledge in each corpus cannot be obtained to the maximum extent.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides an object relation extraction system and an object relation extraction method based on a template library. The extraction system obtains a higher-quality triple through an information frame extraction module and an attribute name merging module; constructing a relation template library through an object relation extraction module to realize the extraction of a new relation instance; the extraction method can establish the relation template, so that the relation example can be extracted more accurately.
The technical scheme is as follows: the invention discloses an object relation extraction system based on a template library, which comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; and the object relation extraction module builds a template library according to the seed relation triple to realize the extraction of the object relation triple in the text.
Specifically, the information frame extraction module extracts the relation triple of the entry from the information frame of the corpus; the corpus information frame is a summary description of the item, and the relationship triple of the item can be extracted from the corpus information frame.
Specifically, the attribute name merging module merges similar attribute names in the relationship triples obtained by the information frame extraction module to obtain seed relationship triples; the attribute name merging module firstly obtains core words of attribute names through syntactic analysis, and then calculates the similarity between the attribute names by using a synonym table, so as to merge similar attribute names.
Specifically, the object relationship extraction module is used for constructing a template library and extracting object relationship triples; the object relation extraction module preprocesses the text corresponding to the entry to obtain a training corpus and a test corpus, extracts a sentence example in the training corpus through the seed relation triple, and constructs a feature vector; finally, generalizing all sentence examples through feature clustering and editing distance to construct a relational template library; and extracting a new relation instance from the test corpus through the relation template library.
The invention also discloses an object relation extraction method using the object relation extraction system based on the template library, which comprises the following steps:
step 1) extracting a relation triple of a required item from a corpus information frame by an information frame extracting module;
step 11) if the relation triple attribute value is a phrase formed by multiple words and is not a numerical value or an identifiable named entity, trimming the attribute value and extracting the identifiable named entity as the attribute value;
and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by special symbols such as- ", and the like, segmenting the attribute values according to the special symbols in the attribute values, wherein each segmented result and the entry form a relationship triple.
Step 2) the attribute name merging module merges similar attribute names in the relation triples obtained by the information frame extraction module to obtain a seed relation triples;
step 21) obtaining a description part of the attribute name by syntactic analysis, and deleting the description part;
step 22) calculating the similarity between the attribute names by utilizing the synonym table; if the eight-bit codes of the two attribute names are completely the same, the two attribute names are indicated to be synonyms, and can be combined into the same attribute name;
if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names wordl, word2, finding out their eight-bit codes code1, code2 in the synonym table; taking the first seven bits of the eight-bit code, and layering the first seven bit code according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):
Figure BDA0002111201030000031
if the level is 0, the similarity is 0, or the two eight-bit codes end at the end of the @, the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown as formula (2), if the last bit is "=", it indicates that the two words are completely equal, and the similarity is 1; if the last bit is "#", it means that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating f (t) =0 at the moment; five levels are weighted by the five-layer structure from top to bottom: 0.65,0.8,0.9,0.96,1;
Figure BDA0002111201030000032
and (4) calculating a total similarity value between the two attribute names, judging whether to combine the two attribute names, and jumping to the step 3) after the combination is finished.
Step 3) the object relation extraction module extracts the text texts of all the entries in the information frame extraction module; firstly, carrying out noise reduction treatment on an extracted text, and removing redundant hyperlinks and labels in the text; then, sentence division is carried out on the text; finally, word segmentation, part of speech recognition and named entity recognition are carried out on the single sentence;
step 4) extracting sentences in which two entities in the seed relationship triple co-occur in the step 3) as sentence examples of the type relationship;
step 5), extracting n-gram word characteristics, n-gram part-of-speech characteristics and distance characteristics of the sentence example to construct a characteristic vector;
step 6) replacing entity names in the sentence examples to obtain a relation template;
step 7) clustering the relation templates through the characteristics in the step 5), and generalizing the intra-cluster templates according to the editing distance;
step 71) carrying out k-means clustering on the relation template of the required relation, wherein the characteristics come from step 5); clustering to obtain template clusters with similar syntactic structures
P={cluster 1 ,cluster 2 ,...,cluster m };
Step 72) selecting a cluster from P i Calculating the editing distance of pairwise relation templates in the cluster;
step 73) initializing the relationship template p according to formula (3) n ,p m Edit distance matrix Edit:
Figure BDA0002111201030000041
wherein i has a value range of (1, | p) n |),|p n I denotes p n Length of (i.e. p) n A total number of words; j has a value in the range of (1, | p) m |),|p m I represents p m Length of (i.e. p) m A total number of words;
step 74) populate the Edit matrix according to equation (4):
Edit(i,j)=min(1+Edit(i-1,j),1+Edit(i,j-1),Edit(i-1,j-1)+d(i,j)) (4)
wherein d (i, j) is used to denote p n [i]And p m [j]Whether or not they are identical, p n [i]And p m [j]Respectively represent templates p n I < th > word and modelPlate p m J word of (1); d (i, j) is calculated as shown in formula (5);
Figure BDA0002111201030000042
equation (4) indicates that p is n [i]Conversion to p m [j]There are three options:
(1) And executing replacement operation: p is to be n [i]By substitution of p m [j]When equation (4) Edit [ i, j ]]The minimum value is: edit [ i, j [ ]]=Edit[i-1,j-1]+ d (i, j), when p n [i]And p m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;
(2) And executing a deleting operation: p is to be n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i-1,j]+1;
(3) And executing a deleting operation: p is to be m [j]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i,j-1]+1;
Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different values of the Edit matrix in the formula (4), the matrix D records corresponding operations of different values; the value of D is: i: indicating an insert operation; r: indicating a delete operation; e: representing equivalence, without doing anything; u: indicating a replacement operation;
step 76) from cluster i Selecting two templates with the minimum editing distance; if the editing distance between the two templates is larger than the threshold value, stopping calculating the cluster i Inner template, return to step 72) to calculate the next cluster; otherwise jump to step 77);
step 77) let P g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]The relation template generalization is carried out according to the operation matrix D obtained by the two relation templates with the minimum editing distance to obtain the generalized template P g
Step 78) slave cluster i Two selected templates with the minimum editing distance are deleted, and a generalization template P is added g Jump to step 73).
And 8) finishing.
Has the beneficial effects that: the invention discloses an object relation extraction system and an object relation extraction method based on a template library, wherein the extraction system obtains a higher-quality triple through an information frame extraction module and an attribute name merging module; constructing a relation template library through an object relation extraction module to realize the extraction of a new relation instance; according to the extraction method, the relation template base is established, so that the object relation is more accurate in extraction, the efficiency is higher, the knowledge-based local knowledge base can be extracted and updated according to the information of the corpus on the Internet, the knowledge is updated quickly, and the knowledge breadth is higher.
Drawings
FIG. 1 is a schematic diagram of a relationship between modules of an object relationship extraction system based on a template library according to the present invention;
FIG. 2 is a flowchart of an object relationship extraction method based on a template library.
Detailed Description
The invention discloses an object relation extraction system based on a template library, which comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; and the object relation extraction module builds a template library according to the seed relation triple to realize the extraction of the object relation triple in the text.
Specifically, the information frame extraction module extracts the relation triple of the items from the information frame of the corpus; the corpus information box is a summary description of the items, and the relationship triples of the items can be extracted from the corpus information box.
Specifically, the attribute name merging module merges similar attribute names in the relationship triples obtained by the information frame extraction module to obtain seed relationship triples; the attribute name merging module firstly obtains core words of attribute names through syntactic analysis, and then calculates the similarity between the attribute names by using a synonym table, so as to merge similar attribute names.
Specifically, the object relationship extraction module is used for constructing a template library and extracting object relationship triples; the object relation extraction module preprocesses the text corresponding to the entry to obtain a training corpus and a test corpus, extracts a sentence example in the training corpus through the seed relation triple, and constructs a feature vector; finally, generalizing all sentence examples through feature clustering and editing distance to construct a relational template library; and extracting a new relation instance from the test corpus through the relation template library.
By analyzing the content of wikipedia and the structure of the local knowledge base, the following problems exist in the process of relationship extraction and entity linkage: first, if the information frame information of wikipedia is extracted by using a conventional semi-structured method, problems of ambiguous words and irregular attribute values may occur. The term "ambiguous word" means that when writing an entry page, different writers have different expression modes, and the same attribute may have different attribute names, for example, for the same attribute "location", there may be attribute names of "place", "location", and the like. Attribute value irregularity means that there are some attribute values that are composed of text or multiple values. These preliminary derived extractions are not of high quality and cannot be added to the local knowledge base. Secondly, the information in the Wikipedia information frames is unbalanced in distribution, the information frames of some entry pages have a large amount of information, the information frames of some entry pages are insufficient, and even some entry pages have no information frame at all. If the information in the information frame is extracted by only adopting a semi-structured extraction method, the knowledge in Wikipedia cannot be maximally obtained.
Therefore, the object relationship extraction method using the object relationship extraction system based on the template library by using Wikipedia as a corpus comprises the following steps:
step 1) extracting a relation triple of a required item from a corpus information frame by an information frame extracting module;
an entry to be operated, such as the three gorges dam, has a relevant link in the wikipedia category directory information, and a relation triple is extracted from an information frame in page content corresponding to the link entering of the entry of the three gorges dam;
the method comprises the following specific steps:
step 11) if the attribute value of the relation triple is a phrase formed by multiple words and is not a numerical value or an identifiable named entity, trimming the attribute value and extracting the identifiable named entity as the attribute value;
and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by special symbols such as "-", "and the like, segmenting the attribute values according to the special symbols in the attribute values, and forming a relationship triple by each segmented result and the entry.
The attribute value refers to an attribute value in an information frame, for example, an attribute key value pair exists in an information frame of a three gorge dam: (the address is located at a position 15KM in the southeast direction of Wuhan city), the relation triple obtained by direct extraction is (the Sanxia dam, the address is located at a position 15KM in the southeast direction of Wuhan city), the relation triple is irregular, the attribute value of 'located at a position 15KM in the southeast direction of Wuhan city' needs to be pruned and simplified into 'Wuhan city', the finally obtained relation triple is (the Sanxia dam, the address and the Wuhan city), and the Wuhan city is a named entity at the moment.
Step 2) the attribute name merging module merges similar attribute names in the relation triples obtained by the information frame extraction module to obtain a seed relation triples;
step 21) obtaining a description part of the attribute name by syntactic analysis, and deleting the description part;
step 22) calculating the similarity between the attribute names by utilizing the synonym table; if the eight-bit codes of the two attribute names are completely the same, the two attribute names are indicated to be synonyms, and can be combined into the same attribute name;
if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names wordl, word2, finding out eight-bit codes (codel, code 2) of the two attribute names wordl and word2 in a synonym table; taking the first seven bits of eight-bit codes, and layering the first seven bits of codes according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):
Figure BDA0002111201030000071
wherein level is the maximum number of layers of the public string t, if the level is 0, the similarity is 0 if the two codes are completely different, or the two eight-bit codes end with '@' to indicate that the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown in formula (2), if the last bit is "=", which means that two words are completely equal, and the similarity is 1; if the end position is "#", it indicates that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating at the moment f (t) =0; five levels are weighted by the five-layer structure from top to bottom: 0.65,0.8,0.9,0.96,1;
Figure BDA0002111201030000081
and (4) calculating a total similarity value between the two attribute names, judging whether to combine the two attribute names, and jumping to the step 3) after the combination is finished.
An example of a similarity calculation is as follows:
if the code of "position" is "Cb01B01", the code of "azimuth" is "Cb01a01", the corresponding five-layer codes are "C B01" and "C B01a01", their level is 3, then 0.65+0.8+0.9=2.35; if level =4, the value is 0.65+0.8+0.9+0.96=3.31, and then the value of equation 2 is 0. If level =5, the value is 0.65+0.8+0.9+0.96=3.31, considering equation 2, at this time looking at the last bit of the two codes (this last bit is "#", "=" and "@", not in the five-layer code), "@" has been considered in equation 1; the final calculation of "#", "=" is 3.31+0.5=3.81,3.31+1=4.31, respectively.
Step 3) the object relation extraction module extracts the text texts of all the entries in the information frame extraction module; firstly, carrying out noise reduction treatment on an extracted text, and removing redundant hyperlinks and labels in the text; then, sentence division is carried out on the text; finally, word segmentation, part of speech recognition and named entity recognition are carried out on the single sentence;
step 4) extracting sentences in which two entities in the seed relationship triples coexist in the step 3) as sentence examples of the type relationship;
step 5) extracting n-gram word characteristics, n-gram part-of-speech characteristics and distance characteristics of the sentence example to construct a characteristic vector;
step 6), replacing the entity name in the sentence example to obtain a relation template;
step 7) clustering the relation templates through the characteristics in the step 5), and generalizing the intra-cluster templates according to the editing distance;
step 71) carrying out k-means clustering on the relation template of the required relation, wherein the characteristics come from step 5); obtaining template clusters with similar syntactic structures through clustering
P={cluster 1 ,cluster 2 ,...,cluster m };
Step 72) selecting a cluster from P i Calculating the editing distance of pairwise relation templates in the cluster;
step 73) initializing the relationship template p according to formula (3) n ,p m Edit distance matrix Edit:
Figure BDA0002111201030000091
wherein i has a value range of (1, | p) n |),|p n I denotes p n Length of (i.e. p) n A total number of words; j has a value in the range of (1, | p) m |),|p m I represents p m Length of (i.e. p) m A total number of words;
step 74) populate the Edit matrix according to equation (4):
Edit(i,j)=min(1+Edit(i-1,j),1+Edit(i,j-1),Edit(i-1,j-1)+d(i,j)) (4)
wherein d (i, j) is used to denote p n [i]And p m [j]Whether or not they are identical, p n [i]And p m [j]Respectively represent templates p n The ith word and the template p m The jth word of (1); d (i, j) is calculated as shown in formula (5);
Figure BDA0002111201030000092
equation (4) indicates that p is n [i]Conversion to p m [j]There are three options:
(1) And executing replacement operation: p is to be n [i]By substitution of p m [j]When equation (4) Edit [ i, j]The minimum value is: edit [ i, j [ ]]=Edit[i-1,j-1]+ d (i, j), when p n [i]And p m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;
(2) And (3) executing a deleting operation: p is to be n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i-1,j]+1;
(3) And executing a deleting operation: p is to be m [j]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i,j-1]+1;
Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different value conditions of the Edit matrix in the formula (4), recording corresponding operations of different values by the matrix D; the value in D is: i: representing an insert operation; r: indicating a delete operation; e: representing equivalence without any operation; u: indicating a replacement operation;
step 76) from cluster i Selecting two templates with the minimum editing distance; if the editing distance between the two templates is larger than the threshold value, stopping calculating the cluster i Inner template, return to step 72) to calculate the next cluster; otherwise jump to step 77);
step 77) let P g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]According to two relation templates with minimum editing distanceThe obtained operation matrix D is used for carrying out relational template generalization to obtain a generalized template P g
Step 78) slave cluster i Two selected templates with the minimum editing distance are deleted, and a generalization template P is added g Jump to step 73).
And 8) finishing.

Claims (6)

1. An object relation extraction system based on a template library comprises an information frame extraction module, an attribute name merging module and an object relation extraction module; the method is characterized in that: the information frame extraction module is used for extracting the relation triple of the information frame of the entry corpus; the attribute name merging module is used for merging similar attribute names in the information frame extraction module relation triples to obtain seed relation triples; the object relation extraction module builds a template library according to the seed relation triples to realize the extraction of the object relation triples in the text;
the information frame extraction module extracts the relation triple of the items from the information frame of the corpus; the corpus information frame is used for describing the outline of the item, and the relation triple of the item can be extracted from the corpus information frame;
the attribute name merging module merges similar attribute names in the relationship triples obtained by the information frame extraction module to obtain a seed relationship triplet; the attribute name merging module firstly obtains core words of attribute names through syntactic analysis, and then calculates the similarity between the attribute names by using a synonym table so as to merge similar attribute names;
the object relation extraction module is used for constructing a template library and extracting object relation triples; the object relation extraction module preprocesses the text corresponding to the entry to obtain a training corpus and a test corpus, extracts a sentence example in the training corpus through the seed relation triple, and constructs a feature vector; finally, generalizing all sentence examples through feature clustering and editing distance to construct a relational template library; and extracting a new relation instance from the test corpus through the relation template library.
2. A template library-based object relationship extraction method using the template library-based object relationship extraction system according to claim 1, characterized in that: the method comprises the following steps:
step 1) extracting a relation triple of a required item from a corpus information frame by an information frame extracting module;
step 2), the attribute name merging module merges similar attribute names in the relation triples obtained by the information frame extraction module to obtain seed relation triples;
step 3), extracting the text of all the entries in the information frame extraction module by an object relation extraction module; firstly, denoising an extracted text, and removing redundant hyperlinks and labels in the text; then, sentence division is carried out on the text; finally, word segmentation, part of speech recognition and named entity recognition are carried out on the single sentence;
step 4) extracting sentences in which two entities in the seed relationship triple co-occur in the step 3) as sentence examples of the type relationship;
step 5), extracting n-gram word characteristics, n-gram part-of-speech characteristics and distance characteristics of the sentence example to construct a characteristic vector;
step 6), replacing the entity name in the sentence example to obtain a relation template;
step 7) clustering the relation templates through the characteristics in the step 5), and generalizing the intra-cluster templates according to the editing distance;
and 8) ending.
3. The template library-based object relationship extraction method according to claim 2, wherein:
the specific steps of extracting the information frame relation triples in the step 1) are as follows:
step 11) if the relation triple attribute value is a phrase formed by multiple words and is not a numerical value or an identifiable named entity, trimming the attribute value and extracting the identifiable named entity as the attribute value;
and step 12) if the attribute values are parallel similar entities, connecting a plurality of parallel similar entities by using special symbols, segmenting the attribute values according to the special symbols in the attribute values, wherein each segmented result and the item form a relationship triple.
4. The method for extracting object relationship based on the template library according to claim 2 or 3, wherein the specific step of combining similar attribute names in the step 2) is as follows:
step 21) obtaining a description part of the attribute name by syntactic analysis, and deleting the description part;
step 22), calculating the similarity between the attribute names by using the synonym table; if the eight-bit codes of the two attribute names are completely the same, the two attribute names are indicated to be synonyms, and can be combined into the same attribute name;
if the eight-bit codes of the two attribute names are not completely the same, calculating the synonymy degree between the two attribute names according to the eight-bit codes of the two attribute names; for two attribute names word1, word2, finding out their eight-bit codes code1, code2 in the synonym table; taking the first seven bits of eight-bit codes, and layering the first seven bits of codes according to a five-layer structure to obtain five-layer codes t1 and t2 of the word1 and the word 2; obtaining a common string t of t1 and t2, wherein the calculation method is shown as formula (1):
Figure FDA0003921892720000021
wherein level is the maximum number of layers of the public string t, if the level is 0, the similarity is 0 if the two codes are completely different, or the two eight-bit codes end with '@' to indicate that the word is independent in the synonym table, and the similarity between the corresponding words is 0; if the level is 5, the five-layer structures of the two words are completely the same, the empowerments of five levels are accumulated, and f (t) is added; f (t) is the last bit of the eight-bit code, the calculation method is shown as formula (2), if the last bit is "=", it indicates that the two words are completely equal, and the similarity is 1; if the end position is "#", it indicates that the two words are similar, and their similarity is 0.5; if the level is 1-4, accumulating the empowerments of the same hierarchy from top to bottom until the hierarchies are different, and stopping accumulating at the moment f (t) =0;
Figure FDA0003921892720000031
and (4) calculating a total similarity value between the two attribute names, judging whether to combine the two attribute names, and jumping to the step 3) after the combination is finished.
5. The template library-based object relationship extraction method according to claim 4, wherein: the five-layer structure of the step 22) weights the five levels from top to bottom i Comprises the following steps:
0.65,0.8,0.9,0.96,1。
6. the method for extracting object relationship based on template library of claim 4, wherein the concrete steps of template generalization in step 7) are as follows:
step 71) carrying out k-means clustering on the required relation template, wherein the characteristics come from step 5); obtaining a template cluster P = { cluster with similar syntactic structure through clustering 1 ,cluster 2 ,...,cluster m };
Step 72) selecting a cluster from P i Calculating the editing distance of pairwise relation templates in the cluster;
step 73) initializing the relationship template p according to formula (3) n ,p m Edit distance matrix Edit:
Figure FDA0003921892720000032
wherein i has a value range of (1, | p) n |),|p n I denotes p n Length of (i.e. p) n A total number of words; the value range of j is (1, | p) m |),|p m I denotes p m Length of (i.e. p) m The total number of words;
step 74) populate the Edit matrix according to equation (4):
Edit(i,j)=min(1+Edit(i-1,j),1+Edit(i,j-1),Edit(i-1,j-1)+d(i,j)) (4)
wherein d (i, j) is used to denote p n [i]And p m [j]Whether or not they are identical, p n [i]And p m [j]Respectively represent templates p n The ith word and the template p m The jth word of (1); d (i, j) is calculated as shown in formula (5);
Figure FDA0003921892720000033
formula (4) shows that p n [i]Conversion to p m [j]There are three options:
(1) And (3) executing replacement operation: p is to be n [i]By substitution of p m [j]When equation (4) Edit [ i, j ]]The minimum value is: edit [ i, j [ ]]=Edit[i-1,j-1]+ d (i, j), when p n [i]And p m [j]When the same, d (i, j) is 0; otherwise, 1 is selected;
(2) And (3) executing a deleting operation: p is to be n [i]Deleted when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i-1,j]+1
(3) And executing a deleting operation: p is to be m [j]Delete when equation (4) Edit [ i, j)]Minimum value: e [ i, j ]]=E[i,j-1]+1
Step 75) while calculating the Edit matrix Edit [ i, j ], recording the operation of minimizing the current Edit distance Edit [ i, j ] by using the matrix D; according to different value conditions of the Edit matrix in the formula (4), recording corresponding operations of different values by the matrix D; the value in D is: i: representing an insert operation; r: indicating a delete operation; e: representing equivalence, without doing anything; u: indicating a replacement operation;
step 76) from cluster i Selecting two templates with the minimum editing distance; if the edit distance between the two templates is larger than the threshold value, stopping calculating cluster i Inner template, return to step 72) to calculate the next cluster; otherwise jump to step 77);
step 77) let P g Empty, starting first from the bottom right corner of matrix D, until D [0,0 ]]Performing template generalization according to an operation matrix D obtained by the two templates with the minimum editing distance;
step 78) slave cluster i Delete two inAdding the selected template with the minimum editing distance into a generalized template P g Jump to step 73).
CN201910583405.5A 2019-06-28 2019-06-28 Object relation extraction system and method based on template library Active CN110390099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910583405.5A CN110390099B (en) 2019-06-28 2019-06-28 Object relation extraction system and method based on template library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910583405.5A CN110390099B (en) 2019-06-28 2019-06-28 Object relation extraction system and method based on template library

Publications (2)

Publication Number Publication Date
CN110390099A CN110390099A (en) 2019-10-29
CN110390099B true CN110390099B (en) 2023-01-31

Family

ID=68286017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910583405.5A Active CN110390099B (en) 2019-06-28 2019-06-28 Object relation extraction system and method based on template library

Country Status (1)

Country Link
CN (1) CN110390099B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969008B (en) * 2019-12-03 2020-08-28 北京中科院软件中心有限公司 Method and system for converting processing procedure description sentences into triple structures
CN111611799B (en) * 2020-05-07 2023-06-02 北京智通云联科技有限公司 Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model
CN111651559B (en) * 2020-05-29 2023-05-26 辽宁工程技术大学 Social network user relation extraction method based on event extraction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance
CN108763353A (en) * 2018-05-14 2018-11-06 中山大学 Rule-based and remote supervisory Baidupedia relationship triple abstracting method
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN110188207A (en) * 2019-05-15 2019-08-30 出门问问信息科技有限公司 Knowledge mapping construction method and device, readable storage medium storing program for executing, electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance
CN108763353A (en) * 2018-05-14 2018-11-06 中山大学 Rule-based and remote supervisory Baidupedia relationship triple abstracting method
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN110188207A (en) * 2019-05-15 2019-08-30 出门问问信息科技有限公司 Knowledge mapping construction method and device, readable storage medium storing program for executing, electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向水利信息资源目录服务的分布式语义检索方法研究;冯钧等;《计算机与现代化》;20150309;全文 *

Also Published As

Publication number Publication date
CN110390099A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN111090461B (en) Code annotation generation method based on machine translation model
CN105718586B (en) The method and device of participle
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN110727880B (en) Sensitive corpus detection method based on word bank and word vector model
CN110390099B (en) Object relation extraction system and method based on template library
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN109117472A (en) A kind of Uighur name entity recognition method based on deep learning
CN106776562A (en) A kind of keyword extracting method and extraction system
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN102662936B (en) Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
CN104408173A (en) Method for automatically extracting kernel keyword based on B2B platform
CN110502744B (en) Text emotion recognition method and device for historical park evaluation
CN112417891B (en) Text relation automatic labeling method based on open type information extraction
CN111178051B (en) Self-adaptive Chinese word segmentation method and device for building information model
CN112559656A (en) Method for constructing affair map based on hydrologic events
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN114997288A (en) Design resource association method
CN111382333B (en) Case element extraction method in news text sentence based on case correlation joint learning and graph convolution
CN110888944B (en) Attention convolutional neural network entity relation extraction method based on multi-convolutional window size
CN113157918A (en) Commodity name short text classification method and system based on attention mechanism
CN111401056A (en) Method for extracting keywords from various texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant