CN115795060A

CN115795060A - Entity alignment method based on knowledge enhancement

Info

Publication number: CN115795060A
Application number: CN202310063495.1A
Authority: CN
Inventors: 杨伊态; 韩小乐; 赵舞玲; 陈胜鹏; 付卓; 王敬佩; 李颖; 黄亚林; 张兆文; 李成涛
Original assignee: Geospace Information Technology Co ltd
Current assignee: Geospace Information Technology Co ltd
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-03-14
Anticipated expiration: 2043-02-06
Also published as: CN115795060B

Abstract

The invention is suitable for the technical field of knowledge maps, and provides an entity alignment method based on knowledge enhancement, which comprises the following steps: s1, training a key fragment extraction model; s2, extracting key fragments in the entity text description by using a key fragment extraction model; s3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm; and S4, merging entity texts corresponding to the key segments describing the same entity. Compared with the existing method, the method integrates external knowledge into the key fragment extraction model, improves the extraction accuracy and generalization capability of the model to the key fragments, and further improves the accuracy and generalization capability of the entity alignment method.

Description

Entity alignment method based on knowledge enhancement

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to an entity alignment method based on knowledge enhancement.

Background

With the gradual maturity of artificial intelligence technology, the use of named entity recognition technology to extract related entities from corpora in the field of urban management has been widely popularized. In a knowledge graph, entity alignment is the alignment of different entity descriptions that point to the same entity. For a particular service, not only a particular type of entity needs to be extracted, but also different textual descriptions of the same entity need to be associated together. As virtual company text description: "first and second geographic information companies", "first and second geographic information", "first and second geography information", etc. all point to the same virtual company: "geographical information of the first year" company Limited. The method for combining different text descriptions of the same entity is an entity alignment method.

The current entity alignment methods are mainly classified into three categories.

The first type is a rule-based entity alignment method. The method designs rules according to the characteristics of the entity description texts, and then judges whether different entity description texts point to the same real entity according to the rules.

For example, a prefix and suffix deactivation dictionary is constructed, prefixes or suffixes of entities containing deactivation dictionaries are removed, then the description of the entities without the suffixes is compared whether to be consistent, and if so, the description is the same entity. For example, the virtual company 1 "wuhan happy star earth letter company" and the virtual company 2 "happy star earth letter company", the prefix deactivation dictionary contains "wuhan", and the suffix deactivation dictionary contains "company". The virtual company "wuhan happy celestial earth letter company" removes the prefix "wuhan" and the suffix "company" and finally "happy celestial earth letter", and the virtual company "happy celestial earth letter company" removes the suffix "company" and finally "happy celestial earth letter", after the prefix and suffix are removed, the descriptions of the two entities are identical, so that the result of the processing by the method is that "wuhan happy earth letter company" and "happy earth letter company" describe the same entity.

For another example, in the method based on the edit distance, whether the two entity description texts point to the same real entity is determined according to the edit distance between the two entity description texts. Such as setting the edit threshold to 0.7. Assuming that the edit distance between the "Happy celestial West letter company" and the "Wuhan Happy celestial West letter company" is 0.78, the result of processing in this way is that the "Wuhan Happy celestial West letter company" and the "Happy celestial West letter company" describe the same entity.

Due to the fixed rule, the method can only process simple entity alignment tasks with single forms, and has poor accuracy for entity alignment tasks with complex text description.

The second category is machine learning based methods. The method designs a machine learning model based on mathematical ideas such as statistics, discrete graph theory and the like, constructs entity characteristics according to entity corpora, inputs the entity characteristics into the model, and judges whether different entity corpora point to the same real entity.

Such as a method based on a word frequency-inverse document (TF-IDF), i.e., if a word occurs frequently in multiple entities, the TF-IDF score of the word is low; if a word occurs less frequently throughout the corpus of entities, but frequently in the entity, the TF-IDF score for the word is higher. And then using the word with higher TF-IDF score as the entity description, comparing the entity descriptions with each other, and judging that the words point to the same real entity if the words are consistent. For example, the virtual company 1 "wuhan happy celestial earth letter company" and the virtual company 2 "happy celestial earth letter company", the word list after the division of words by the virtual company 1 is [ "wuhan", "happy", "celestial", "earth letter", "company" ], and the word list after the division of words by the virtual company 2 is [ "happy", "celestial", "earth letter", "company" ]. Assuming that "wuhan" and "company" are frequently present in the whole entity corpus, the score of the word "wuhan" and "company" is low, and "happy", "star", and "earth letter" are high, so the text of the virtual company 1 after TF-IDF filtering is described as [ happy star earth letter ], the text of the virtual company 2 after TF-IDF filtering is also described as [ happy star earth letter ], and thus the result of processing in this way is that "happy star earth letter company" of wuhan and "happy star earth letter company" describe the same entity.

The method can process entity alignment tasks with complex text description, but the generalization capability of the method is poor because the method only utilizes shallow semantic information.

The third category is based on deep learning methods. In the method, the distinguishing characteristics of the linguistic data are automatically learned by constructing a deep neural network, and whether different text descriptions point to the same real entity or not is judged.

Such as a model based on word vector similarity. Firstly, a pre-training neural network model is used as a word vector model, each word in a vocabulary is converted into a vector, and a word vector of the word is formed; and then judging whether the two vocabularies point to the same entity or not according to the similarity of the word vectors between the vocabularies.

The method can process entity alignment tasks with complex text description and can utilize deep semantic information, but the generalization capability is still insufficient. For example, a model trained by using data of a virtual city "haha city" as a training set can understand a text segment "haha city", a text segment "humming city", and text segments "haha" are administrative districts of the city, but it is difficult to understand the text segment "humming", because the text segment "humming" does not appear in the training set nor in the symbolic text "city", and therefore the importance of the text segment "humming" in the description of the entity cannot be judged (for example, for the entity "humming happy celestial earth signal company", if it can be judged as an administrative title, the model can put emphasis on the contrast of the text description of "happy earth signal company").

Disclosure of Invention

In view of the above problems, the present invention aims to provide an entity alignment method based on knowledge enhancement, aiming to solve the technical problem of insufficient extraction accuracy and generalization capability of the existing method.

The invention adopts the following technical scheme:

the entity alignment method based on knowledge enhancement comprises the following steps:

s1, training a key fragment extraction model;

s2, extracting key fragments in the entity text description by using a key fragment extraction model;

and S3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm.

And S4, merging entity texts corresponding to the key segments describing the same entity.

Further, the specific process of step S1 is as follows:

s11, inputting a training sample set, wherein the sample format in the sample set is [ entity text, key fragment ];

s12, converting the word into corresponding word codes by using a BERT word separator aiming at each sample, adding special character codes at the head and the tail of the word codes to form word codes of an entity text, and then inputting the word codes into a BERT model to obtain an embedded vector of each word;

s13, performing word segmentation processing on the entity text by using various word segmentation tools aiming at each sample to obtain a multi-group word segmentation list;

s14, classifying each word in each group of word segmentation list by using a PaddleNLP tool to obtain the class attribute of the word;

s15, weighting and fusing the embedded vector of each character in the entity text and the category vector corresponding to the category attribute of the word to which the character belongs to obtain a fused embedded vector of each character;

s16, forming entity embedded vectors by the fused embedded vectors of each word and the vectors corresponding to the special characters;

s17, inputting the fused entity embedding vector into a BilSTM network to obtain a corpus fragment emission matrix;

s18, inputting the emission matrix into a CRF network, and calculating to obtain a correct mark sequence score and a total score of all possible mark sequences according to the emission matrix and the transfer matrix;

s19, calculating loss scores according to the scores of the correct marker sequences and the total scores of all possible marker sequences;

and S110, traversing the training samples, modifying and updating the parameters of the key segment extraction model by using a gradient descent method after the training samples are traversed once, and then selecting a parameter version with the highest verification accuracy as a finally trained model by using the accuracy of the verification sample test model.

Further, there are three word segmentation tools, and the specific process of step S15 is as follows:

151. for each character Ci in the entity text, searching the word of the character Ci in the three word segmentation lists to obtain the category attribute of the word;

152. acquiring category vectors of the characters Ci in the three word segmentation lists according to the category attributes;

153. and performing weighted fusion on each character Ci:

，

，

，

wherein, in the step (A),

an embedding vector obtained by passing the word Ci through a Bert model,

the list of participles contains for the ith a category vector of the words Ci,

represents the weight of the ith spliced vector,

show that

And with

The two parts are spliced together to form a whole,

a vector is embedded for the fusion of words Ci.

Further, the process of step S17 is: inputting the fused entity embedded vector into a BilSTM network to obtain a hidden layer state vector of the entity, and inputting the hidden layer state vector into a full-connection layer to obtain a corpus fragment emission matrix Emit _ m, wherein Emit _ m is a Tag _ num × Span _ Len dimensional matrix, tag _ num is the number of tags, and the tags are 5 tags of O, B, I, E and S, and Span _ Len is the number of embedded vectors of word vectors of text entities.

Further, the process of step S18 is:

181. for each word in the entity text, if the word does not belong to the key segment, marking the word as O, if the word belongs to the key segment and the number of characters of the key segment is more than 1, marking the first character as B, marking the last character as E, marking other words as I, and if the character belongs to the keyword and the number of characters of the keyword is equal to 1, marking the character as S;

182. forming a sequence from each word in the entity text, and adding marks O to the head and the tail of the sequence to form a correct mark sequence corresponding to the sample;

183. and inputting the emission matrix Emit _ m into a CRF network, wherein the CRF network obtains the correct mark sequence score and the total score of all possible mark sequences by using a loss score formula based on the emission matrix Emit _ m and a transfer matrix Trans _ m.

Further, the specific process of step S2 is as follows:

s21, obtaining a corresponding key fragment of each entity text in the candidate entity corpus by using a key fragment extraction model;

s22, if the key fragment deduced by the key fragment extraction model is empty, deleting the corresponding entity text, if the deduced key fragment is not empty, combining the obtained key fragment and the entity text to obtain an entity key fragment tuple, wherein all entity key fragment tuples jointly form an entity key fragment tuple corpus;

and S23, carrying out deduplication processing on all the obtained key fragments to obtain key fragment corpora.

Further, the specific process of step S3 is as follows:

s31, constructing a directed graph Key _ Map for each Key segment in the Key segment corpus, specifically, each Key segment is used as a Node, and endowing the Node with an attribute, namely entity type, traversing the nodes, comparing each Node with other nodes only once, and if the nodes are compared _i Including Node _j All the words of, then Node _i And Node _j Connect an edge, and Node _i Adding 1,node to the degree of addition _j Adding 1 to the out degree of (1);

s32, for the directed graph Key _ Map, setting the entity class value of the node with the out degree of 0 as the entity class value;

s33, traversing the nodes with the out degrees larger than 0, and for all the nodes with the out degrees of 1, assigning the entity class values of the nodes with the out degrees larger than 0 as the entity class values of the arc head nodes, and subtracting 1 from the out degrees of the nodes;

s34, traversing nodes with the output degrees larger than 0, checking the arc head node of each node, if the entity type value of the arc head node is also the arc head node of the current node, deleting the connecting line of the arc head node, and subtracting 1 from the output degree of the current node;

s35, repeating the steps S33 and S34 until the directed graph traverses all the nodes without change;

s36, for each entity key fragment tuple, searching a node corresponding to the key fragment in the directed graph, and replacing the key fragment with the entity category value of the corresponding node;

and S37, traversing the entity key fragment tuple corpus, and merging and aggregating the same key fragments into one type.

Further, the specific process of step S4 is as follows:

and merging the entity texts in the entity key fragment tuples corresponding to the key fragments which are gathered into one category.

The beneficial effects of the invention are: compared with a classical key segment extraction model, namely a BERT (binary-dominant inverse notation), bi-LSTM and CRF (fuzzy object model), the method has the advantages that the embedded vectors of the words and the category information of the words formed by the words are fused in the fusion layer, so that each word in the entity text can be more accurately expressed by the embedded vectors, and the generalization capability is stronger; meanwhile, additional knowledge information is blended, the accuracy of the key segment extraction model in the entity alignment method is enhanced, and therefore the accuracy of entity alignment is improved. The method can automatically cluster different description texts of the same entity together under the big data scene, and provides support and guarantee for public opinion analysis in urban treatment and establishment of an entity knowledge graph.

Drawings

FIG. 1 is a flow chart of an entity alignment method based on knowledge enhancement according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a key segment extraction model provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a fusion layer provided by an embodiment of the invention;

fig. 4 is a schematic diagram of an example of the merging algorithm of the directed graph.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

As shown in fig. 1, the entity alignment method based on knowledge enhancement provided by this embodiment includes the following steps:

s1, training a key fragment extraction model.

The step is a model training phase, and a schematic diagram of a key segment extraction model is shown in fig. 2, wherein a flow of a fusion layer is shown in fig. 3. With reference to fig. 2 and 3, the specific process of model extraction in this step is as follows:

s11, inputting a training sample set, wherein the sample format in the sample set is [ entity text, key fragment ].

The training samples are accurate sample data. If a sample I: the Hahah Happy celestial West company, happy celestial earth letter ]; sample two: [ Ha Hao Xin invest Co., ltd., and (5) open investment.

And S12, aiming at each sample, converting the sample into a corresponding character code by using a BERT word segmentation device, adding special character codes at the head and the tail of the sample to form a character code of an entity text, and then inputting a BERT model to obtain an embedded vector of each character.

Each character of the sample is converted into a corresponding character code by using a BERT word segmentation device, and special character codes are added at the head and the tail of the codes to form the character codes of the entity text. This step uses the BERT model, chinese-BERT-wwm-extBERT (Bidirectional Encoder retrieval from transformations).

For example, the entity text "Hahah city Happy Star West corporation" is converted into a word code as follows: [101, 1506, 1506, 2356, 2571, 727, 3215, 4413, 1765, 928, 1062, 1385, 102]. Where 101 is the encoding of the special character 'CLS' and 102 is the encoding of the special character 'SEP'. For each word of the entity text, the code "101" starts and the code "102" ends. And then inputting the word codes of the entity texts into a BERT model to obtain an embedded vector Char _ Tensor of each word.

And S13, performing word segmentation processing on the entity text by using various word segmentation tools aiming at each sample to obtain a multi-group word segmentation list.

For example, for each sample, three word segmentation tools are used for performing word segmentation processing on the entity text respectively, and the three word segmentation tools are jieba word segmentation, thuseg word segmentation and thulac word segmentation to obtain a three-group word segmentation list.

For example, the entity text "Hahah city Happy celestial earth letter company" is the result after the word segmentation by three word segmentation tools:

word segmentation list 1: [ Haha, city, happy, celestial, west letters Co.)

Word segmentation list 2: [ Ha City, happy, celestial sphere, dixin, co.)

Word segmentation list 3: [ Haha, city, happy Star, dixin, inc. ].

And S14, classifying each word in each group of word segmentation list by using a PaddleNLP tool to obtain the category attribute of the word.

PaddleNLP is an open source natural language processing development library. The category attribute of each word in the word segmentation list is obtained by using a named entity recognition tool in the PaddleNLP, and the PaddleNLP divides all the words into 93 categories.

For example, word segmentation list 1: [ Haha, city, happy, celestial, west, inc. ], after PaddleNLP treatment, the following results were obtained:

[ (haha, vocabularies), (city, world class _ zoning concept), (happy, modifier), (planet, world class _ geography concept), (geocaching company, organizational class _ concept) ];

as the word segmentation list 2: [ Ha, happy, celestial, west, inc. ] after PaddleNLP treatment, the following results were obtained:

[ (haha city, world class), (happy, modifier), (planet, world class _ geographical concept), (geography, term class), (company, organization class _ concept) ];

as in the participle list 3: after PaddleNLP treatment, [ haha, city, happy planet, believe, inc ], the following results were obtained:

[ (haha, anaglyphs), (city, world class _ zoning concept), (happy planet, organizational class), (geography, terminology class), (company, organizational class _ concept) ].

S15, weighting and fusing the embedded vector of each character in the entity text and the category vector corresponding to the category attribute of the word containing the character to obtain the fused embedded vector of each character.

The specific process of weighted fusion is as follows:

151. for each character Ci in the entity text, searching the word of the character Ci in the three word segmentation lists to obtain the category attribute of the word.

152. And acquiring the category vectors of the characters Ci in the three word segmentation lists according to the category attributes. The category vector for the three-component word list of words Ci is: class1_ sensor, class2_ sensor, class3_ sensor, which may be identical due to the consistent Class of words. In this embodiment, the key fragment extraction model sets a 93 × 768 category matrix, where each 1 × 768 vector represents a category. The category matrix is a parameter which can be continuously learned and changed after multiple times of training, and the initial assignment is a random assignment.

153. And performing weighted fusion on each word Ci:

，

，

，

wherein, in the step (A),

an embedding vector obtained by passing the word Ci through a Bert model,

the list of participles contains for the ith a category vector of the words Ci,

represents the weight of the ith spliced vector,

show that

And with

The two parts are spliced together to form a whole,

a vector is embedded for the fusion of words Ci.

For example, the word "haha" is subjected to word segmentation by a three-component word list to obtain corresponding words of "haha", "haha city" and "haha"

Assuming that the category attribute corresponding to "haha" obtained in step 151 is "pseudonym", and the "pseudonym" corresponds to the vector in the 11 th row in the category matrix, the 1 x 768-dimensional vector in the 11 th row in the category matrix is taken out as the category vector of the word "ha" included in the first participle list, and the category vectors of the word "ha" included in the second and third participle lists can be obtained in the same way. Then, the 'Ha' fusion embedded vector can be obtained according to the weighted fusion calculation formula in turn.

And S16, forming entity embedded vectors by the fused embedded vectors of each word and the vectors corresponding to the special characters. The special characters here are a step special character 'CLS' and a special character 'SEP'.

And S17, inputting the fused entity embedded vector into a BilSTM network to obtain a corpus fragment emission matrix.

During specific operation, the fused entity embedded vector is input to a BilSTM network to obtain a hidden layer state vector of the entity, and then the hidden layer state vector is input to a full connection layer to obtain a corpus fragment emission matrix Emit _ m, wherein Emit _ m is a matrix of Tag _ num × Span _ Len dimension, tag _ num is the number of marks, wherein the marks comprise 5 of O, B, I, E and S, span _ Len is the number of embedded vectors of word vectors of text entities, specifically the number of words of the text entities is increased by 2, namely the number of words of the text entities and the sum of two special characters.

And S18, inputting the emission matrix into a CRF network, and calculating to obtain the scores of the correct mark sequences and the total scores of all possible mark sequences according to the emission matrix and the transfer matrix.

The specific process of the step is as follows:

181. for each word in the entity text, if the word does not belong to the key segment, the word is marked as O, if the word belongs to the key segment and the number of characters of the key segment is more than 1, the first character is marked as B, the last character is marked as E, other words are marked as I, and if the character belongs to the keyword and the number of characters of the keyword is equal to 1, the character is marked as S.

182. And forming a sequence from each word in the entity text, and adding marks O at the head and the tail of the sequence, namely marks corresponding to two special marks 'CLS' and 'SEP', so as to form a correct mark sequence corresponding to the obtained sample.

The steps 181 and 182 mainly achieve that for each sample, a correct marking sequence corresponding to the sample is obtained according to the entity text and the key fragment in the sample. For example:

a first sample: the Hahah Happy celestial West company, happy celestial earth letter

The sequence obtained after step 181 is labeled as [ O, O, O, B, I, I, I, I, E, O, O ]

The correct sequence obtained after step 182 is labeled as [ O, O, O, O, B, I, I, I, I, E, O, O ].

183. And inputting the emission matrix Emit _ m into a CRF network, wherein the CRF network obtains the correct mark sequence score and the total score of all possible mark sequences by using a loss score formula based on the emission matrix Emit _ m and the transfer matrix Trans _ m.

Wherein the correct tag sequence refers to the same sequence as the corpus sequence tag of the sample, and all possible tag sequences refer to the sum of the sequences that the model can generate, totaling

And (3) seed sequences, wherein Tag _ num is the number of the marks in the mark set, and Span _ Len is the number of the word vectors in the entity. The transfer matrix Trans _ m in the CRF network is initially a randomly assigned matrix, and the value in the training Trans _ m of the s-th time is the value adjusted after the training of the s-1 th time.

The score for each marker sequence is calculated as:

score (x, y) represents the fraction of input sample x, labeled as marker sequence y. Wherein

Representing the transmission probability value of the ith tag in the predicted tag sequence y, s being the length of the entire predicted tag sequence y,

indicating the transfer probability value of the (i-1) th label to the (i) th label in the predicted label y.

And S19, calculating loss scores according to the scores of the correct mark sequences and the total scores of all possible mark sequences.

The loss fraction calculation formula is:

indicating the correct marker sequence score for the input sample x.

Representing the fraction of any possible marker sequence for input sample x,

the expression is based on the natural index e, the score of the marker sequence is the index, and the cumulative sum of all possible marker sequences.

Indicating a loss score for the input sample x for which the correct tag sequence is y \773;.

The model traverses the training samples for multiple times, and after the training samples are traversed once, the gradient descent method is used for modifying and updating model parameters, and then the accuracy of the model is tested by using the verification samples. In the verification process, if the obtained key fragment is inconsistent with the key fragment of the sample, the failure is indicated, and if the obtained key fragment is consistent with the key fragment of the sample, the hit is indicated. The total number of hits divided by the total number of verification samples is the accuracy of the model. And selecting the parameter version with the highest verification accuracy as the final trained model.

And S2, extracting key segments in the entity text description by using a key segment extraction model.

And (3) completing model training in the step S1, and performing inference by using a key fragment extraction model in the step. The specific process of the step is as follows:

and S21, obtaining corresponding key fragments of each entity text in the candidate entity corpus by using a key fragment extraction model.

And for each entity text, inputting the model to obtain a marking sequence of the entity text. The marking sequence removes 1 mark at the head and tail, and then takes out all the words of the positions of non-O marks as the key fragment corresponding to the entity text.

For example, the entity text "Hahah city Happy celestial West letter company" is input into the model to obtain the tag sequence of [ O, O, O, O, B, I, I, I, E, O, O ]. Removing head and tail labels to obtain [ O, O, O, B, I, I, I, I, E, O, O ],

and taking out the non-O marked words to obtain a key fragment 'Happy celestial earth letter'.

And S22, if the key fragment deduced by the key fragment extraction model is empty, deleting the corresponding entity text, if the deduced key fragment is not empty, combining the obtained key fragment and the entity text to obtain an entity key fragment tuple, wherein all entity key fragment tuples jointly form an entity key fragment tuple corpus. Each entity key fragment tuple consists of [ entity text, key fragment ].

For example, the entity text "Hahah city Happy celestial earth letter company" and the key segment "Hahah city Hah planet earth letter", the key fragment tuple is composed [ Haha Happy celestial earth letter company, happy celestial earth letter ].

If the key segment extracted by the model is empty (namely the predicted tag sequence is [ O, O, O, O, O ]), the entity text 'haha city' is deleted.

All entity key fragment tuples together form an entity key fragment tuple corpus.

And S23, carrying out deduplication processing on all the obtained Key segments to obtain Key segment linguistic data Key _ Corpus.

This step completes the entity key segment clustering and entity alignment. As shown in fig. 4, the method comprises the following steps:

s31, constructing a directed graph Key _ Map for each Key segment in the Key segment corpus, specifically, each Key segment is used as a Node, and is given with an attribute, namely an entity type, the entity type is assigned to the Key segment, nodes are traversed, each Node is compared with other nodes only once, and if the nodes are compared, if the nodes are in the Node, the nodes are in the Node-Node state, and the nodes are in the Node-Node state _i Including Node _j All the words of (2), then Node _i And Node _j Connect an edge, and Node _i Adding 1,node to the degree of addition _j The out-degree of (c) is added by 1. In this embodiment, call Node _i Is Node _j Arc head of, node _j Is Node _i Arc tails, i.e. nodes _j Pointing to Node _i 。

For example, the key fragment corpus [ kefa, qiancao development, qiancao company, kilogram development limited company, kefa, science and technology development hall, qiancao ], and node 1 represents kefa, node 2 represents qiancao development, and the other nodes are shown in fig. 4.

Taking node 1 as an example, node 4, node 2, node 5, and node 6 all contain all words of node 1, so the out-degree of node 1 is 4. And

nodes

4, 2, 5,6 each add 1 to the degree of incoupling.

Taking node 7 as an example,

nodes

2,4,3 all contain all words of node 7, so node 7 has an out-degree of 3.

The finally constructed directed graph is shown in fig. 4, stage 1, where the entity class values of all nodes are themselves.

And S32, for the directed graph Key _ Map, setting the entity class value of the node with the out degree of 0 as the entity class value.

And S33, traversing the nodes with the out degrees larger than 0, and for all the nodes with the out degrees of 1, assigning the entity class values of the nodes with the out degrees larger than 0 to the entity class values of the arc head nodes, and subtracting 1 from the out degrees of the nodes.

S34, traversing nodes with the output degrees larger than 0, checking the arc head node of each node, if the entity type value of the arc head node is also the arc head node of the current node, deleting the connecting line of the arc head node, and subtracting 1 from the output degree of the current node; such as a node _i Out degree of 1, and point to node _j I.e. node _j Is a node _i Arc head, node _j The entity class value at this time is node _k Then node _i And assigning the entity class value of the user as the node _k And the output is reduced by 1.

And S35, repeating the steps S33 and S34 until the directed graph traverses all the nodes without change.

Steps S32 to S35 are specific operation procedures made on the directed graph Key _ Map. With reference to fig. 4, an example process is detailed as follows:

in the first step, 2 nodes with out degree of 0, node 4 and node 6, are found. And assigning the entity classes thereof as self, namely assigning the node 4 as a 'thousand Ke development Limited company' and assigning the node 6 as a 'science and technology development hall'.

Second, finding out a node with a degree of departure of 1, and executing step S33:

if the output degree of the node 3 is 1, the entity class of the node 3 is assigned to 'Qiancuo development Co., ltd', and the output degree is reduced by 1;

if the degree of departure of the node 2 is 1, the entity class of the node 2 is assigned to 'thousand Ke development Limited company', and the degree of departure is reduced by 1;

if the out-degree of the node 5 is 1, the entity class of the node 5 is assigned to the science and technology development hall, and the out-degree is reduced by 1.

Third, step S34 is executed:

the node 7 has three arc head nodes, namely

nodes

2,3 and 4, and because the entity class of the node 2 is the node 4 and the node 4 also belongs to the arc head of the node 7, the arc head of the node 2 is deleted and the degree of in-depth is reduced by 1; node 3 and node 2 are similar, the arc head of node 3 is also deleted, and the degree is reduced by 1; at this time, the arc head node of the node 7 is only 1, and the out degree is also 1.

Node 1 has four arc end nodes, similar to node 7 operation, can delete arc end node 2, arc end node 5, the out degree is changed from 4 to 2.

At this time, as shown in stage 2 in fig. 4.

Fourth, step S33 is executed.

If the out-degree of the node 7 is 1, the entity class of the node 7 is assigned to 'Qiancuo development Co., ltd'.

Fifth, step S34 is executed.

No node changes.

Sixth, step S33 is executed.

No node changes.

Seventh, step S34 is executed.

No node changes.

At this time, after the operations of S33 and S34, the directed graph is not changed, and is stopped as shown in stage 3 in fig. 4.

Finally, the entity categories of the

nodes

2,3,4 and 7 are all 'Chike development Limited', the entity categories of the

nodes

5 and 6 are all 'science and technology development hall', and the entity category of the node 1 is 'science and technology development'.

S36, for each entity key fragment tuple, searching a node corresponding to the key fragment in the directed graph, and replacing the key fragment with the entity category value of the corresponding node.

The processing of this step is performed for each entity key fragment tuple in the entity key fragment tuple corpus set. The entity type value of the node where the key segment is located is 'Qiancuo development Co., ltd' like entity key segment tuple [ Khaha city Qiancou shares company, qiancou ], so the key segment is changed into 'Qiancou development Co., ltd'. The entity key fragment tuple after modification is: [ Khaha, kyoco, inc. ].

Such as entity key segment tuple [ Khaha K K.K., K. development Co., ltd ], [ Khaha K.K., K. development Co., ltd ], [ K.K. ]

After the operation of the step, the company entitled Oncorhyne Kazakh, the development of Oncorhyne Kazakh is a category, and the company entitled Oncorhyne Kazakh is a category.

That is, in the example of step S37, after the combination of the steps, the final result is output as [ [ hamh city department stock company, hamh department development ], [ hamh city department company ] ].

According to the invention, the embedded vectors of the characters and the category information of the words formed by the characters are fused at the fusion layer, so that each character in the entity text can be more accurately expressed by the embedded vectors, and the generalization capability is stronger. For example, the model is trained by data from the Wuhan region and then inferred in the Yichang region. For the entity text 'Yichang Happy celestial sphere company', although the model does not see 'Yichang', the entity text is subjected to word segmentation to obtain the word 'Yichang', the class of 'Yichang' obtained by a PaddleNLP tool is 'world region class', the class of 'Yichang' obtained by the PaddleNLP tool is 'world region class', and the model can more accurately judge that 'Yichang' does not belong to a key segment. The PaddleNLP tool is trained using large-scale data, and contains rich comprehensive knowledge. The knowledge enhancement of the invention is to blend the knowledge in the PaddleNLP tool into the model by the method, thereby improving the inference accuracy and generalization capability of the model.

In summary, the invention provides an entity alignment method based on knowledge enhancement. Different description texts of the same entity can be automatically clustered together under the big data scene, and support and guarantee are provided for public opinion analysis in city treatment and establishment of an entity knowledge graph.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for entity alignment based on knowledge enhancement, the method comprising the steps of:

s1, training a key fragment extraction model;

s3, constructing a directed graph according to the extracted key fragments, and merging the key fragments describing the same entity by using a merging algorithm;

2. The entity alignment method based on knowledge enhancement as claimed in claim 1, wherein the specific process of the step S1 is as follows:

s12, aiming at each sample, converting the sample into a corresponding character code by using a BERT word segmentation device, adding special character codes at the head and the tail of the sample to form a character code of an entity text, and then inputting a BERT model to obtain an embedded vector of each character;

s19, calculating loss scores according to the scores of the correct mark sequences and the total scores of all possible mark sequences;

and S110, traversing the training samples, modifying and updating parameters of the key fragment extraction model by using a gradient descent method after the training samples are traversed once, and selecting a parameter version with the highest verification accuracy as a finally trained model by using the accuracy of the verification sample test model.

3. The entity alignment method based on knowledge enhancement as claimed in claim 2, wherein the participlers have three types, and the specific process of step S15 is as follows:

153. and performing weighted fusion on each character Ci:

，

，

，

wherein, in the step (A),

an embedding vector obtained by passing the word Ci through a Bert model,

the list of participles contains for the ith a category vector of the words Ci,

represents the weight of the ith spliced vector,

show that

And

the two parts are spliced together to form a whole,

a vector is embedded for the fusion of words Ci.

4. The entity alignment method based on knowledge enhancement as claimed in claim 3, wherein the process of step S17 is: inputting the fused entity embedded vector into a BilSTM network to obtain a hidden layer state vector of the entity, and inputting the hidden layer state vector into a full-connection layer to obtain a corpus fragment emission matrix Emit _ m, wherein Emit _ m is a matrix of Tag _ num _ Span _ Len dimensions, tag _ num is the number of marks, and the marks are 5 of O, B, I, E and S, and Span _ Len is the number of embedded vectors of word vectors of the text entity.

5. The entity alignment method based on knowledge enhancement as claimed in claim 4, wherein the process of step S18 is:

181. for each word in the entity text, if the word does not belong to the key segment, marking the word as O, if the word belongs to the key segment and the character number of the key segment is more than 1, marking the first character as B, marking the last character as E, marking other words as I, and if the character belongs to the keyword and the character number of the keyword is equal to 1, marking the character as S;

182. forming a sequence in the sequence of each word in the entity text, and adding marks O to the head and the tail of the sequence to form a correct mark sequence corresponding to the sample;

6. The method for entity alignment based on knowledge enhancement as claimed in claim 5, wherein the specific process of step S2 is as follows:

and S23, performing deduplication processing on all the obtained key fragments to obtain key fragment corpora.

7. The entity alignment method based on knowledge enhancement as claimed in claim 6, wherein the specific process of the step S3 is as follows:

s31, constructing a directed graph Key _ Map for each Key segment in the Key segment corpus, specifically, taking each Key segment as a Node, giving an attribute, namely an entity category, to each Key segment, and assigning the entity category to each Key segment; traversing nodes, comparing each Node with other nodes only once, and if nodes are compared _i Including Node _j All the words of, then Node _i And Node _j Connect an edge, and Node _i Adding 1,node to the degree of addition _j Adding 1 to the output of the step (A);

s33, traversing the nodes with the out-degrees larger than 0, and for all the nodes with the out-degrees of 1, assigning the entity class values as the entity class values of the arc head nodes, and subtracting 1 from the out-degrees of the nodes;

8. The entity alignment method based on knowledge enhancement as claimed in claim 7, wherein the specific process of the step S4 is as follows: