CN115795060B - Entity alignment method based on knowledge enhancement - Google Patents
Entity alignment method based on knowledge enhancement Download PDFInfo
- Publication number
- CN115795060B CN115795060B CN202310063495.1A CN202310063495A CN115795060B CN 115795060 B CN115795060 B CN 115795060B CN 202310063495 A CN202310063495 A CN 202310063495A CN 115795060 B CN115795060 B CN 115795060B
- Authority
- CN
- China
- Prior art keywords
- entity
- word
- key
- node
- key segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention is applicable to the technical field of knowledge graphs, and provides a knowledge-enhancement-based entity alignment method, which comprises the following steps: s1, training a key segment extraction model; s2, extracting key fragments in the entity text description by using a key fragment extraction model; s3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm; and S4, merging entity texts corresponding to the key fragments describing the same entity. Compared with the existing method, the method integrates external knowledge into the key segment extraction model, improves the extraction accuracy and generalization capability of the model to the key segments, and further improves the accuracy and generalization capability of the entity alignment method.
Description
Technical Field
The invention belongs to the technical field of knowledge graphs, and particularly relates to a knowledge-enhancement-based entity alignment method.
Background
With the gradual maturation of artificial intelligence technology, the use of named entity recognition technology to extract related entities from corpora in the field of urban governance has been popular on a large scale. In the knowledge graph, entity alignment is the alignment of descriptions of different entities pointing to the same entity. For a particular service, not only is it necessary to extract an entity of a particular type, but also different text descriptions of the same entity need to be associated together. As described by virtual company text: "Egypt geographic information company", "Egypt geographic information", etc. all point to the same virtual company: "Ea and B geographical information shares Limited". The method of associating and merging different text descriptions of the same entity is an entity alignment method.
Currently, the entity alignment methods are mainly classified into three types.
The first category is rule-based entity alignment methods. The method designs rules according to the characteristics of the entity description texts, and then judges whether different entity description texts point to the same real entity according to the rules.
For example, constructing prefix and suffix deactivation dictionary, removing prefix or suffix containing deactivation dictionary from entity, comparing whether the entity descriptions with prefix and suffix removed are identical, if so, indicating that the same entity is described. Such as virtual company 1 "martial arts of happy star letter" and virtual company 2 "happy star letter", the prefix deactivation dictionary contains "martial arts of happy star letter", and the suffix deactivation dictionary contains "company". The virtual company 'Wuhan Happy star letter company' removes the prefix 'Wuhan' and the suffix 'company', and finally 'Happy star letter', the virtual company 'Happy star letter company' removes the suffix 'company', and finally 'Happy star letter', and after removing the prefix and the suffix, the description of the two entities are identical, so that the results processed by the method are that the 'Wuhan Happy star letter company' and the 'Happy star letter company' describe the same entity.
And judging whether the two entity description texts point to the same real entity or not according to the edit distance between the two entity description texts. Such as setting the edit threshold to 0.7. Assuming that the edit distance of "Happy star West company" and "Mars Happy star West company" is 0.78, the result of this processing is that "Mars Happy star West company" and "Happy star West company" describe the same entity.
Because of fixed rules, the method can only process simple entity alignment tasks with single forms, and has poor accuracy for entity alignment tasks with complex text descriptions.
The second category is machine learning based methods. The method designs a machine learning model based on mathematical ideas such as statistics, discrete graph theory and the like, constructs entity characteristics according to entity corpus, inputs the entity characteristics into the model, and judges whether different entity corpus points to the same real entity.
Such as word frequency-inverse document (TF-IDF) based methods, i.e. if the frequency of occurrence in a word is high in a plurality of entities, the TF-IDF score for the word is relatively low; if a word occurs less frequently in the entire entity corpus, but frequently in the entity, the TF-IDF score of the word is higher. Words with higher TF-IDF scores are then used as this entity description, and then the entity descriptions are compared with each other, and if they match, it is determined that they point to the same real entity. For example, the virtual company 1 is a "star with happiness" and the virtual company 2 is a "star with happiness", the word list after the word division of the virtual company 1 is [ "marten", "happiness", "star", "ground letter", "company" ], and the word list after the word division of the virtual company 2 is [ "happiness", "star", "ground letter", "company" ]. Assuming that "martial arts", "company" frequently occur in the whole entity corpus, the scores of the words "martial arts", "company" are low, and "happy", "planet", "earth letter" are high, so that the text of the virtual company 1 after TF-IDF filtering is described as [ happy planet earth letter ], the text of the virtual company 2 after TF-IDF filtering is also described as [ happy planet earth letter ], and thus the result of the processing by this method is that the same entity is described by "martial arts of happy planet company" and "happy planet earth letter".
Such methods are capable of handling entity alignment tasks where text descriptions are more complex, but because they only use shallow semantic information, they are poorly generalized.
The third category is a deep learning-based approach. The method autonomously learns the distinguishing characteristics of the corpus and judges whether different text descriptions point to the same real entity by constructing a deep neural network.
Such as a model based on word vector similarity. Firstly, using a pre-training neural network model as a word vector model, converting each word in a vocabulary into a vector, and forming a word vector of the word; and then judging whether the two vocabularies point to the same entity according to the similarity of the word vectors among the vocabularies.
Such methods can handle entity alignment tasks where text descriptions are more complex and can utilize deep semantic information, but the generalization capability is still insufficient. For example, the data of the virtual city "haha city" is used as a training set training model, so that the text segment "haha city" can be understood, the text segment "hum city" is a administrative district, but the text segment "hum" is difficult to understand, because the text segment "hum" neither appears in the training set nor has the marked text "city", and therefore, the importance degree of the text segment "hum" in the entity description cannot be judged (for example, for the entity "humming star letter company", if the entity "humming star letter company" can be judged to be a administrative title, the model can focus on the comparison of the text description of the "Happy star letter company").
Disclosure of Invention
In view of the above problems, the present invention aims to provide a knowledge-based enhanced entity alignment method, which aims to solve the technical problems of insufficient extraction accuracy and generalization capability of the existing method.
The invention adopts the following technical scheme:
the entity alignment method based on knowledge enhancement comprises the following steps:
s1, training a key segment extraction model;
s2, extracting key fragments in the entity text description by using a key fragment extraction model;
and S3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm.
And S4, merging entity texts corresponding to the key fragments describing the same entity.
Further, the specific process of step S1 is as follows:
s11, inputting a training sample set, wherein the sample format in the sample set is [ entity text, key fragment ];
s12, converting each sample into a corresponding word code by using a BERT word segmentation device, adding special character codes at the head and the tail to form a word code of a solid text, and inputting the word code into a BERT model to obtain an embedded vector of each word;
s13, aiming at each sample, performing word segmentation processing on the entity text by using a plurality of word segmentation tools to obtain a multi-group word segmentation list;
s14, classifying each word in each group of word list by using a PaddleNLP tool to obtain the category attribute of the word;
s15, embedding vectors of each word in the entity text and category vectors corresponding to the category attributes of the word containing the word are weighted and fused to obtain fused embedding vectors of each word;
s16, forming an entity embedded vector by the fusion embedded vector of each word and the vector corresponding to the special character;
s17, inputting the fused entity embedded vector into a BiLSTM network to obtain a corpus fragment emission matrix;
s18, inputting the transmission matrix into a CRF network, and calculating to obtain the correct mark sequence score and the total score of all possible mark sequences according to the transmission matrix and the transfer matrix;
s19, calculating a loss score according to the correct mark sequence score and the total score of all possible mark sequences;
s110, traversing the training samples, after traversing the training samples once, modifying and updating key segment extraction model parameters by using a gradient descent method, and then testing the accuracy of the model by using the verification samples, and selecting a parameter version with the highest verification accuracy as a final trained model.
Further, the word segmentation tools have three kinds, and the specific process of step S15 is as follows:
151. for each word Ci in the entity text, searching words belonging to the word Ci in the three word segmentation lists, and obtaining category attributes of the belonging words;
152. acquiring category vectors of the word Ci in three word segmentation lists according to the category attributes;
153. each word Ci is weighted fused:,/>,,/>wherein->An embedded vector obtained for word Ci via the Bert model, < >>Category vector containing word Ci for the ith word segmentation list, +.>Weight representing the i-th post-splice vector,/->The representation will->And->Spliced together, are added with>The vector is embedded for the fusion of words Ci.
Further, the process of step S17 is as follows: the method comprises the steps of inputting a fused entity embedding vector into a BiLSTM network to obtain a hidden layer state vector of the entity, and inputting the hidden layer state vector into a full-connection layer to obtain a corpus fragment transmitting matrix Emit_m, wherein Emit_m is a matrix of Tag_num, span_Len dimensions, tag_num is the number of marks, wherein the number of marks is totally 5, and span_Len is the number of embedding vectors of word vectors of text entities.
Further, the process of step S18 is as follows:
181. for each word in the entity text, marking as O if the word does not belong to a key segment, marking as B if the word belongs to the key segment and the character number of the key segment is greater than 1, marking as E if the last word belongs to the key segment, marking as I if the other words belong to the key segment and the character number of the key segment is equal to 1, and marking as S;
182. forming a sequence in the sequence of each word in the entity text, and adding marks O at the head and tail of the sequence to form a correct mark sequence corresponding to the sample;
183. the transmit matrix emit_m is input to the CRF network, which uses the loss score formula to derive the correct signature sequence score and the total score of all possible signature sequences based on the transmit matrix emit_m and the transfer matrix trans_m.
Further, the specific process of step S2 is as follows:
s21, obtaining a corresponding key segment of each entity text in the candidate entity corpus by using a key segment extraction model;
s22, if the key segment extracted by the key segment extraction model is empty, deleting the corresponding entity text, if the inferred key segment is not empty, combining the obtained key segment with the entity text to obtain an entity key segment tuple, wherein all entity key segment tuples jointly form an entity key segment tuple corpus;
s23, performing de-duplication treatment on all the obtained key fragments to obtain a key fragment corpus.
Further, the specific process of step S3 is as follows:
s31, constructing a directed graph Key_Map for each Key segment in the Key segment corpus, specifically, using each Key segment as a Node, giving an attribute, namely an entity class, traversing the Node, comparing each Node with other nodes only once, and if the Node is compared with the other nodes i Comprises Node j All words of Node i And Node j Connect one edge, and Node i Adding 1, node j Adding 1 to the output degree of (2);
s32, setting an entity class value as a node with a degree of egress of 0 for a directed graph Key_Map;
s33, traversing nodes with the degree of emergence being greater than 0, and for all nodes with the degree of emergence being 1, assigning an entity class value to an entity class value of an arc head node of the node, wherein the degree of emergence of the node is reduced by 1;
s34, traversing nodes with the degree of emergence being greater than 0, checking the arc head node of each node, deleting the connecting line of the arc head node if the entity class value of the arc head node is also the arc head node of the current node, and subtracting 1 from the degree of emergence of the current node;
s35, repeating the steps S33 and S34 until the directed graph traverses all nodes without change;
s36, for each entity key segment tuple, searching a node corresponding to the key segment in the directed graph, and replacing the key segment with an entity class value of the corresponding node;
s37, traversing the entity key segment tuple corpus, and merging the same key segments into one type.
Further, the specific process of step S4 is as follows:
and merging entity texts in entity key fragment tuples corresponding to the key fragments gathered into one type.
The beneficial effects of the invention are as follows: compared with a classical key segment extraction model, namely a combined model of BERT, bi-LSTM and CRF, the invention fuses the embedded vectors of words and the category information of words formed by the words at a fusion layer, so that the embedded vectors can more accurately express each word in a solid text and have stronger generalization capability; meanwhile, additional knowledge information is integrated, so that the accuracy of the key segment extraction model in the entity alignment method is enhanced, and the accuracy of entity alignment is improved. According to the method, different description texts of the same entity can be automatically clustered together under the condition of big data, and support and guarantee are provided for public opinion analysis in urban treatment and entity knowledge graph construction.
Drawings
FIG. 1 is a flow chart of a knowledge-based enhanced entity alignment method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a key segment extraction model provided by an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a fusion layer provided in an embodiment of the invention;
FIG. 4 is a schematic diagram of an example of a directed graph merging algorithm.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
As shown in fig. 1, the entity alignment method based on knowledge enhancement provided in this embodiment includes the following steps:
and S1, training a key segment extraction model.
The step is a model training stage, a schematic diagram of the key segment extraction model is shown in fig. 2, and a flow of the fusion layer is shown in fig. 3. With reference to fig. 2 and 3, the specific process of model extraction in this step is as follows:
s11, inputting a training sample set, wherein the sample format in the sample set is [ entity text, key fragment ].
Training samples are accurate sample data. Sample one: [ Hahal Happy Star West company, happy Star West ]; sample two: [ Haha open investment Limited, open investment ].
S12, converting each sample into a corresponding word code by using a BERT word segmentation device, adding special character codes at the head and the tail to form a word code of the entity text, and inputting the word code into a BERT model to obtain an embedded vector of each word.
Each word of the sample is converted into a corresponding word code by using a BERT word segmentation device, and special character codes are added at the head and the tail of the code to form the word code of the entity text. This step uses the BERT model Chinese-BERT-wwm-extBERT (Bidirectional Encoder Representation from Transformers).
For example, the entity text "Ha Happy star DiXin company" is converted into word codes: [101, 1506, 1506, 2356, 2571, 727, 3215, 4413, 1765, 928, 1062, 1385, 102]. Where 101 is the encoding of the special character 'CLS' and 102 is the encoding of the special character 'SEP'. For each word of entity text, the code "101" starts and the code "102" ends. And then inputting the word codes of the entity text into the BERT model to obtain an embedded vector Char_Tensor of each word.
S13, aiming at each sample, performing word segmentation processing on the entity text by using a plurality of word segmentation tools to obtain a multi-group word segmentation list.
For example, for each sample, three word segmentation tools are used for respectively carrying out word segmentation processing on the entity text, wherein the three word segmentation tools are jieba word segmentation, threeg word segmentation and threac word segmentation, and a three-component word list is obtained.
For example, the entity text "Hahashi Happy star letter company" is divided into words by three word-dividing tools:
word segmentation list 1: [ haha, city, happy, star, dishi Co ]
Word segmentation list 2: [ Haha, happy, star, wednesday, company ]
Word segmentation list 3: [ haha, city, happy star, west letter, company ].
S14, classifying each word in each group of word list by using a PaddleNLP tool to obtain the category attribute of the word.
PaddleNLP is an open source natural language processing development library. This step uses the named entity recognition tool in the PaddleNLP to obtain the category attributes of each word in the word list, and the PaddleNLP classifies all the words into 93 categories.
For example, word segmentation list 1: [ haha, city, happy, planet, ground letter company ], after PaddleNLP treatment, the following results were obtained:
[ (haha, personification), (city, world region class_region concept), (happy, modifier), (planet, world region class_geographical concept), (ground letter company, organization class_concept) ];
such as word segmentation list 2: [ Haha, happy, star, west, co ], after PaddleNLP treatment, the following results were obtained:
[ (haha, world regional class), (happy, modifier), (planet, world regional class_geographical concept), (geography, term class), (company, organization class_concept) ];
such as word segmentation list 3: [ haha, city, happy star, west letter, company ], after PaddleNLP treatment, the following results were obtained:
[ (haha, personification), (city, world region class_division concept), (Happy star, organization class), (geography, term class), (company, organization class_concept) ].
S15, the embedded vector of each word in the entity text and the category vector corresponding to the category attribute of the word containing the word are weighted and fused to obtain the fused embedded vector of each word.
The specific process of weighted fusion is as follows:
151. for each word Ci in the entity text, searching words belonging to the word Ci in the three word segmentation lists, and obtaining category attributes of the belonging words.
152. And obtaining the category vector of the word Ci in the three word segmentation lists according to the category attribute. The class vector of the three-component word list of word Ci is: the three category vectors of Class1_Tensor, class2_Tensor, class3_Tensor may be identical due to the consistency of the word categories. In this embodiment, the key segment extraction model sets a category matrix of 93×768, where each 1×768 vector represents a category. The category matrix is a parameter which can be continuously learned and changed through multiple times of training, and initial assignment is random assignment.
153. Each word Ci is weighted fused:
,/>,/>,wherein->An embedded vector obtained for word Ci via the Bert model, < >>Category vector containing word Ci for the ith word segmentation list, +.>Weight representing the i-th post-splice vector,/->The representation will->And->Spliced together, are added with>The vector is embedded for the fusion of words Ci.
If the word is 'ha', the word is segmented through a three-component word list, and the corresponding words are 'ha', 'ha shit', 'ha'
Assuming that the category attribute corresponding to "haha" obtained in step 151 is "personification" and the vector corresponding to 11 th row in the category matrix of "personification", 1 x 768 dimension vector of 11 th row in the category matrix is extracted and taken as the category vector of the first word segmentation list containing the word "ha", and the category vectors of the second word segmentation list and the third word segmentation list containing the word "ha" can be obtained by the same method. And then obtaining the fusion embedded vector of the hash according to the weighted fusion calculation formula in turn.
S16, combining the fusion embedded vector of each word with the vector corresponding to the special character to form an entity embedded vector. The special characters herein are the step special character 'CLS' and the special character 'SEP'.
S17, inputting the fused entity embedded vector into a BiLSTM network to obtain a corpus fragment transmitting matrix.
In specific operation, the fused entity embedded vector is input to the BiLSTM network to obtain an entity hidden state vector, and then the hidden state vector is input to the full-connection layer to obtain a corpus fragment emission matrix Emit_m, wherein Emit_m is a matrix of Tag_num, span_Len dimension, tag_num is the number of marks, wherein the number of marks is O, B, I, E, S, span_Len is the number of embedded vectors of word vectors of text entities, and specifically is the number of text entity words plus 2, namely the sum of the number of text entity words and two special characters.
S18, inputting the transmission matrix into the CRF network, and calculating to obtain the correct mark sequence score and the total score of all possible mark sequences according to the transmission matrix and the transfer matrix.
The specific process of the method is as follows:
181. for each word in the entity text, if the word does not belong to the key segment, the word is marked as O, if the word belongs to the key segment and the number of key segment characters is greater than 1, the first character is marked as B, the last character is marked as E, the other words are marked as I, and if the word belongs to the key word and the number of key word characters is equal to 1, the character is marked as S.
182. And forming a sequence according to the sequence of each word in the entity text, and adding marks O at the head and tail of the sequence, namely marks corresponding to two special marks 'CLS' and 'SEP', so as to form a correct mark sequence corresponding to the sample.
The steps 181 and 182 mainly realize that for each sample, the correct marking sequence corresponding to the sample is obtained according to the entity text and the key fragment in the sample. For example:
sample one: [ Hahal Happy Star DiXin Co., hahal Happy Star DiXin ]
The sequence obtained after step 181 is marked as [ O, O, O, B, I, I, I, I, E, O, O ]
The correct sequence obtained after step 182 is labeled [ O, O, O, O, B, I, I, I, I, E, O, O, O ].
183. The transmit matrix emit_m is input to the CRF network, which uses the loss score formula to derive the correct signature sequence score and the total score of all possible signature sequences based on the transmit matrix emit_m and the transfer matrix trans_m.
Wherein the correct marker sequence refers to the same sequence as the corpus sequence marker of the sample, and all possible marker sequences refer to the sum of sequences which can be generated by the model, and the sum is calculatedSeed sequence, where tag_num is the number of tags in the Tag set and span_len is the number of word vectors in the entity. The transfer matrix Trans_m in the CRF network is initially a matrix with random assignment, and the value in the s-th training Trans_m is the value adjusted after the s-1 th training.
The score calculation formula for each tag sequence is:
score (x, y) represents the input sample x, labeled as a fraction of the labeling sequence y. Wherein the method comprises the steps ofA transmission probability value representing the ith tag in the predicted tag sequence y, s being the length of the entire predicted tag sequence y,/->A transition probability value indicating that the i-1 st marker in the predictive markers y transitions to the i markers.
S19, calculating the loss score according to the correct marking sequence score and the total score of all possible marking sequences.
The loss fraction calculation formula is:
representing the correct marker sequence score for the input sample x. />Representing the fraction of any possible marker sequences for the input sample x,/for>The representation is based on the natural index e, the score of the tag sequence is an index, and the sum of all possible tag sequences is accumulated. />Representing the loss fraction of y ̅ for the correct tag sequence for the input sample x.
S110, traversing the training samples, after traversing the training samples once, modifying and updating key segment extraction model parameters by using a gradient descent method, and then testing the accuracy of the model by using the verification samples, and selecting a parameter version with the highest verification accuracy as a final trained model.
The model traverses the training sample for a plurality of times, and after each time, the model uses a gradient descent method to modify and update model parameters, and then uses a verification sample to test the accuracy of the model. In the verification process, if the obtained key fragment is inconsistent with the key fragment of the sample, a miss is indicated, and if the obtained key fragment is consistent with the key fragment of the sample, a hit is indicated. Dividing the total number of hits by the total number of verification samples to obtain the accuracy of the model. And selecting one parameter version with the highest verification accuracy as a final trained model.
And S2, extracting the key fragments in the entity text description by using a key fragment extraction model.
And (3) training the model after the step S1 is completed, and deducing by using the key segment extraction model in the step. The specific process of the method is as follows:
s21, obtaining the corresponding key fragments of each entity text in the candidate entity corpus by using a key fragment extraction model.
For each entity text, a model is input to obtain a marking sequence of the entity text. The tag sequence removes the first and last 1 tag each and then takes out the words of all non-O tag positions as the key segments for this entity text.
For example, the entity text Hahal Happy star West letter company is input into the model to obtain the mark sequence of [ O, O, O, O, B, I, I, I, E, O, O, O ]. Removing head and tail marks to obtain [ O, O, O, B, I, I, I, E, O, O ],
and (5) taking out the non-O marked words to obtain a key fragment 'Happy star Dixin'.
S22, if the key segment inferred by the key segment extraction model is empty, deleting the corresponding entity text, and if the inferred key segment is not empty, combining the obtained key segment with the entity text to obtain an entity key segment tuple, wherein all the entity key segment tuples together form an entity key segment tuple corpus. Each entity key fragment tuple consists of [ entity text, key fragment ].
Such as the entity text "Hahashi Happy star ground letter company" and the key segment "Happy star ground letter", the key fragment tuple [ Hahal Happy star West communication company, happy star West communication ].
If the key segment extracted by the model is empty (i.e. the predicted tag sequence is [ O, O, O, O ]) the entity text "haha city", the entity text "haha city" is deleted.
All entity key fragment tuples together constitute an entity key fragment tuple corpus.
S23, performing de-duplication treatment on all the obtained Key fragments to obtain a Key fragment Corpus Key_Corpus.
And S3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm.
The step completes entity key fragment clustering and entity alignment. As shown in fig. 4, the method comprises the following steps:
s31, constructing a directed graph Key_Map for each Key segment in the Key segment corpus, specifically, each Key segment is taken as a Node, and is endowed with an attribute, namely an entity category, the entity category is assigned as the entity, the Node is traversed, each Node is compared with other nodes only once, and if the Node is compared with the other nodes i Comprises Node j All words of Node i And Node j Connect one edge, and Node i Adding 1, node j The output of (2) is increased by 1. In this embodiment, node is referred to i Is Node j Arc head, node j Is Node i Arc tails, i.e. Node j Pointing to Node i 。
Such as key segment corpus [ scientific, thousand-family development, thousand-family company, kilogram development limited, scientific development hall, thousand-family ], and node 1 represents the scientific, node 2 represents the thousand-family development, and other nodes are shown in fig. 4.
Taking node 1 as an example, node 4, node 2, node 5, and node 6 all contain all the words of node 1, so node 1 has an outbound degree of 4. And nodes 4, 2, 5,6 each have an entry of 1.
Taking node 7 as an example, node 2,4,3 contains all words of node 7, so node 7 has an out-degree of 3.
The final structured directed graph is shown in stage 1 of fig. 4, where the entity class values of all nodes are themselves.
S32, setting an entity class value as a node with the degree of egress of 0 for the directed graph Key_Map.
S33, traversing nodes with the degree of emergence being greater than 0, and for all nodes with the degree of emergence being 1, assigning an entity class value to the entity class value of the arc head node, wherein the degree of emergence of the nodes is reduced by 1.
S34, traversing the nodes with the degree of emergence being greater than 0, checking the arc head node of each node, deleting the connection line of the arc head node if the entity class value of the arc head node is also the arc head node of the current node, and connecting the current nodeSubtracting 1 from the node output; such as node i Has an outfeed of 1 and points to the node j I.e. node j Is node i Arc head, node j The entity class value at this time is node k Node then i Also assign its entity class value as node k And its degree of emergence is reduced by 1.
S35, repeating the steps S33 and S34 until the directed graph traverses all nodes without change.
Steps S32-S35 are specific operational procedures made to the directed graph key_map. With reference to fig. 4, an example of the process is specifically as follows:
in the first step, 2 nodes with a degree of 0, node 4 and node 6 are found. And assign their entity class to itself, node 4 to "Qianke development Co., ltd", and node 6 to "science and technology development Hall".
Second, find a node with a degree of 1, and execute step S33:
the degree of the node 3 is 1, and the entity class of the node 3 is assigned as 'Qianke development Limited company', and the degree is reduced by 1;
the degree of the node 2 is 1, and the entity class of the node 2 is assigned as 'Qianke development Limited company', and the degree is reduced by 1;
and if the degree of the node 5 is 1, assigning the entity category of the node 5 as a 'science and technology development hall', and subtracting 1 from the degree.
Third, step S34 is performed:
the node 7 has three arc head nodes, namely nodes 2,3 and 4, and because the entity class of the node 2 is the node 4 and the node 4 also belongs to the arc head of the node 7, the arc head of the node 2 is deleted and the input degree is reduced by 1; node 3 is similar to node 2, and the arc head of node 3 is deleted, and the degree of departure is reduced by 1; at this time, the number of arc head nodes of the node 7 is only 1, and the output degree is also 1.
Node 1 has four arc head nodes, similar to node 7 in operation, arc head node 2 may be deleted, arc head node 5, and the degree of departure is changed from 4 to 2.
At this time, as shown in stage 2 in fig. 4.
Fourth, step S33 is performed.
And if the degree of the node 7 is 1, assigning the entity category of the node 7 as 'Qianke development Limited'.
Fifth, step S34 is performed.
No node changes.
Sixth, step S33 is performed.
No node changes.
Seventh, step S34 is performed.
No node changes.
At this time, the directed graph is stopped after the operations of S33 and S34 without change as shown in stage 3 in fig. 4.
Finally, the entity categories of nodes 2,3,4 and 7 are all "Qianke development Limited", the entity categories of nodes 5 and 6 are all "science and technology development Hall", and the entity category of node 1 is "scientific and development".
S36, for each entity key segment tuple, searching a node corresponding to the key segment in the directed graph, and replacing the key segment with an entity class value of the corresponding node.
And carrying out the step processing on each entity key fragment tuple in the entity key fragment tuple corpus. The entity class value of the node where the key fragment is located is "Qianke development Limited", so that the key fragment is changed into "Qianke development Limited". The entity key fragment tuple after the modification is as follows: [ Qianke Co., ltd., qianke development Co., ltd ].
S37, traversing the entity key segment tuple corpus, and merging the same key segments into one type.
Such as solid key fragment tuple [ Qianke, qianke development Co., ltd ], [ Hahaku Korea, korea ], [ Hahaku Korea, qianke development Co., ltd ]
After the operation of this step, [ Qianke Co., hahaku development ] is a category, and [ Hahaku Co., ltd ] is a category.
And S4, merging entity texts corresponding to the key fragments describing the same entity.
And merging entity texts in entity key fragment tuples corresponding to the key fragments gathered into one type.
That is, in the example of step S37, after the combination of the steps, the final result is outputted as [ [ haha department of thousand, division of haha department ], [ haha department of the family of the company ] ].
According to the invention, the embedded vector of the word and the category information of the word formed by the word are fused in the fusion layer, so that the embedded vector can more accurately express each word in the entity text, and the generalization capability is stronger. For example, models are trained from data from the Wuhan region and then extrapolated from Yichang region. For the entity text "Yichang Happy star company", the model does not see "Yichang", but the entity text is segmented to obtain the word "Yichang", the category of the "Yichang" obtained by the PaddleNLP tool is "world region category", and the obtained "Wuhan" in training is also "world region category", so that the model can more accurately judge that the "Yichang" does not belong to the key fragment. The PaddleNLP tool is trained using large-scale data, with rich comprehensive knowledge. The knowledge of the invention is enhanced, namely, knowledge in the PaddleNLP tool is fused into the model by the method, so that the inference accuracy and generalization capability of the model are improved.
In summary, the invention provides a knowledge enhancement-based entity alignment method. Different description texts of the same entity can be automatically clustered together under the condition of big data, and support and guarantee are provided for public opinion analysis in urban treatment and entity knowledge graph construction.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.
Claims (5)
1. A method of knowledge-based enhanced entity alignment, the method comprising the steps of:
s1, training a key segment extraction model;
s2, extracting key fragments in the entity text description by using a key fragment extraction model;
s3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm;
s4, merging entity texts corresponding to the key fragments describing the same entity;
the specific process of the step S1 is as follows:
s11, inputting a training sample set, wherein the sample format in the sample set is [ entity text, key fragment ];
s12, converting each sample into a corresponding word code by using a BERT word segmentation device, adding special character codes at the head and the tail to form a word code of a solid text, and inputting the word code into a BERT model to obtain an embedded vector of each word;
s13, aiming at each sample, performing word segmentation processing on the entity text by using a plurality of word segmentation tools to obtain a multi-group word segmentation list;
s14, classifying each word in each group of word list by using a PaddleNLP tool to obtain the category attribute of the word;
s15, embedding vectors of each word in the entity text and category vectors corresponding to the category attributes of the word containing the word are weighted and fused to obtain fused embedding vectors of each word;
s16, forming an entity embedded vector by the fusion embedded vector of each word and the vector corresponding to the special character;
s17, inputting the fused entity embedded vector into a BiLSTM network to obtain a corpus fragment emission matrix;
s18, inputting the transmission matrix into a CRF network, and calculating to obtain the correct mark sequence score and the total score of all possible mark sequences according to the transmission matrix and the transfer matrix;
s19, calculating a loss score according to the correct mark sequence score and the total score of all possible mark sequences;
s110, traversing the training samples, after traversing the training samples once, modifying and updating key segment extraction model parameters by using a gradient descent method, and then testing the accuracy of the model by using a verification sample, and selecting a parameter version with the highest verification accuracy as a final trained model;
the specific process of the step S2 is as follows:
s21, obtaining a corresponding key segment of each entity text in the candidate entity corpus by using a key segment extraction model;
s22, if the key segment extracted by the key segment extraction model is empty, deleting the corresponding entity text, if the inferred key segment is not empty, combining the obtained key segment with the entity text to obtain an entity key segment tuple, wherein all entity key segment tuples jointly form an entity key segment tuple corpus;
s23, performing de-duplication treatment on all the obtained key fragments to obtain a key fragment corpus;
the specific process of the step S3 is as follows:
s31, constructing a directed graph Key_Map for each Key segment in the Key segment corpus, specifically, using each Key segment as a Node, giving an attribute, namely an entity class, to each Key segment, and assigning the entity class as the entity class; traversing nodes, comparing each Node with other nodes only once, if Node i Comprises Node j All words of Node i And Node j Connect one edge, and Node i Adding 1, node j Adding 1 to the output degree of (2);
s32, setting an entity class value as a node with a degree of egress of 0 for a directed graph Key_Map;
s33, traversing nodes with the degree of emergence being greater than 0, and for all nodes with the degree of emergence being 1, assigning an entity class value to an entity class value of an arc head node of the node, wherein the degree of emergence of the node is reduced by 1;
s34, traversing nodes with the degree of emergence being greater than 0, checking the arc head node of each node, deleting the connecting line of the arc head node if the entity class value of the arc head node is also the arc head node of the current node, and subtracting 1 from the degree of emergence of the current node;
s35, repeating the steps S33 and S34 until the directed graph traverses all nodes without change;
s36, for each entity key segment tuple, searching a node corresponding to the key segment in the directed graph, and replacing the key segment with an entity class value of the corresponding node;
s37, traversing the entity key segment tuple corpus, and merging the same key segments into one type.
2. The knowledge-based entity alignment method of claim 1, wherein the word segmentation tools have three types, and the step S15 comprises the following specific processes:
151. for each word Ci in the entity text, searching words belonging to the word Ci in the three word segmentation lists, and obtaining category attributes of the belonging words;
152. acquiring category vectors of the word Ci in three word segmentation lists according to the category attributes;
153. each word Ci is weighted fused:
wherein (1)>An embedded vector obtained for word Ci via the Bert model, < >>Category vector containing word Ci for the ith word segmentation list, +.>Weight representing the i-th post-splice vector,/->The representation will->And->Spliced together, are added with>The vector is embedded for the fusion of words Ci.
3. The knowledge-based enhanced entity alignment method of claim 2, wherein the process of step S17 is: the method comprises the steps of inputting a fused entity embedding vector into a BiLSTM network to obtain a hidden layer state vector of the entity, and inputting the hidden layer state vector into a full-connection layer to obtain a corpus fragment transmitting matrix Emit_m, wherein Emit_m is a matrix of Tag_num, span_Len dimensions, tag_num is the number of marks, wherein the number of marks is totally 5, and span_Len is the number of embedding vectors of word vectors of text entities.
4. The knowledge-based enhanced entity alignment method of claim 3, wherein the process of step S18 is:
181. for each word in the entity text, marking as O if the word does not belong to a key segment, marking as B if the word belongs to the key segment and the character number of the key segment is greater than 1, marking as E if the last word belongs to the key segment, marking as I if the other words belong to the key segment and the character number of the key segment is equal to 1, and marking as S;
182. forming a sequence in the sequence of each word in the entity text, and adding marks O at the head and tail of the sequence to form a correct mark sequence corresponding to the sample;
183. the transmit matrix emit_m is input to the CRF network, which uses the loss score formula to derive the correct signature sequence score and the total score of all possible signature sequences based on the transmit matrix emit_m and the transfer matrix trans_m.
5. The knowledge-based entity alignment method according to claim 4, wherein the specific procedure of step S4 is as follows:
and merging entity texts in entity key fragment tuples corresponding to the key fragments gathered into one type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310063495.1A CN115795060B (en) | 2023-02-06 | 2023-02-06 | Entity alignment method based on knowledge enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310063495.1A CN115795060B (en) | 2023-02-06 | 2023-02-06 | Entity alignment method based on knowledge enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115795060A CN115795060A (en) | 2023-03-14 |
CN115795060B true CN115795060B (en) | 2023-04-28 |
Family
ID=85429829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310063495.1A Active CN115795060B (en) | 2023-02-06 | 2023-02-06 | Entity alignment method based on knowledge enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115795060B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116579344B (en) * | 2023-07-12 | 2023-10-20 | 吉奥时空信息技术股份有限公司 | Case main body extraction method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680488B (en) * | 2020-06-08 | 2023-07-21 | 浙江大学 | Cross-language entity alignment method based on knowledge graph multi-view information |
CN111753024B (en) * | 2020-06-24 | 2024-02-20 | 河北工程大学 | Multi-source heterogeneous data entity alignment method oriented to public safety field |
WO2022011681A1 (en) * | 2020-07-17 | 2022-01-20 | 国防科技大学 | Method for fusing knowledge graph based on iterative completion |
CN114528411B (en) * | 2022-01-11 | 2024-05-07 | 华南理工大学 | Automatic construction method, device and medium for Chinese medicine knowledge graph |
-
2023
- 2023-02-06 CN CN202310063495.1A patent/CN115795060B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115795060A (en) | 2023-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111694924B (en) | Event extraction method and system | |
CN110826331B (en) | Intelligent construction method of place name labeling corpus based on interactive and iterative learning | |
CN110134757B (en) | Event argument role extraction method based on multi-head attention mechanism | |
CN110110054B (en) | Method for acquiring question-answer pairs from unstructured text based on deep learning | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
US7689527B2 (en) | Attribute extraction using limited training data | |
CN109657239A (en) | The Chinese name entity recognition method learnt based on attention mechanism and language model | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN114781392A (en) | Text emotion analysis method based on BERT improved model | |
CN115599902B (en) | Oil-gas encyclopedia question-answering method and system based on knowledge graph | |
CN116737967A (en) | Knowledge graph construction and perfecting system and method based on natural language | |
CN116484024A (en) | Multi-level knowledge base construction method based on knowledge graph | |
CN115795060B (en) | Entity alignment method based on knowledge enhancement | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN113779988A (en) | Method for extracting process knowledge events in communication field | |
CN113868380A (en) | Few-sample intention identification method and device | |
CN111209362A (en) | Address data analysis method based on deep learning | |
CN116166773A (en) | Variant text recognition method and device and readable storage medium | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN113779992B (en) | Implementation method of BcBERT-SW-BiLSTM-CRF model based on vocabulary enhancement and pre-training | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN111898337B (en) | Automatic generation method of single sentence abstract defect report title based on deep learning | |
CN115270774B (en) | Big data keyword dictionary construction method for semi-supervised learning | |
CN116304064A (en) | Text classification method based on extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |