CN113361283A

CN113361283A - Web table-oriented paired entity joint disambiguation method

Info

Publication number: CN113361283A
Application number: CN202110720148.2A
Authority: CN
Inventors: 吴天星; 李林; 漆桂林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-07

Abstract

The invention discloses a Web table-oriented paired entity joint disambiguation method, which is used for solving the Web table-oriented entity link task. The Web form oriented entity linking task is to link an entity mention in a Web form to an entity in a knowledge base without ambiguity. The invention designs a united disambiguation method for entities aiming at the characteristics of a form, iteratively and oppositely disambiguates the pair of entities with the highest reliability by combining, and gradually realizes all disambiguation of the entities in the whole form. The confidence calculation method comprehensively considers various information, including similarity between entity mentions and candidate entities, consistency between linked entities, and semantic consistency of rows and columns in a table. In the iterative process of the algorithm, the linked entities have high confidence coefficient, and effective auxiliary information can be provided for subsequent linking work, so that high-quality joint disambiguation is realized.

Description

Web table-oriented paired entity joint disambiguation method

Technical Field

The invention relates to a Web table-oriented paired entity joint disambiguation method, belonging to the technical field of knowledge maps.

Background

Web tables organize data in a structured form, providing high quality and high density information. It is estimated that the Web contains 141 hundred million tables, with about 1.54 million associated tables. In order to be able to exploit these value-dependent data, it is necessary for the computer to be able to understand these tables from a semantic level. Entity linking of tables is an effective means for realizing table understanding.

Linking entities in a table requires associating entity references in table cells with corresponding entities in the knowledge-graph. An efficient form entity linking system should be able to unambiguously link an entity reference to a corresponding entity in the knowledge-graph based on the context information of the entity reference in the form. Unlike the unity of the context structure of the entity references in the text, the context of the entity references in the table differ in the form of cell position, row, column angle, etc. The table entity linking method firstly needs to identify entity mentions from the table and generate candidate entities for the entity mentions, and this part of work usually uses some heuristic methods to find entity mentions and candidate entities as comprehensive as possible. Disambiguation of candidate entities is then achieved by picking the right and appropriate entities from the candidate entities for linking by virtue of the entity mentions the context in the table and the relationships between the linked entities.

Identification of entity mentions and generation of candidate entities can often be achieved with good results using engineering methods. Candidate entity disambiguation is a major difficulty in table entity linking, and the task needs to design a ranking model to calculate the similarity between entity mentions and different candidate entities. When calculating the similarity, not only the semantic similarity between the entity mention and the candidate entity but also the correlation between the linked entities need to be considered. An entity disambiguation method that utilizes correlation relationships between linked entities is referred to as a joint disambiguation method. When much work is currently done in joint disambiguation, entities that are as related as possible are selected from a candidate set of entities mentioned by all entities, and the correlation of linked entities and the similarity between entity mentions and linked entities are maximized. The joint disambiguation method obtains good disambiguation effect, but has the defect of strong assumption, and is not completely suitable for the knowledge map and the Web table in reality. Entities in the same row that are not in the primary key column tend to have strong correlations with entities in the primary key column, but do not necessarily have high correlations with other columns. Also affected by the incompleteness of the knowledge-graph, linked entities in the same column may not be particularly relevant. The invention provides a paired entity joint disambiguation algorithm aiming at the defects of the current joint disambiguation algorithm, which is used for carrying out joint disambiguation on a pair of entity mentions with highest confidence in a table in sequence, so that the probability of introducing noise is reduced while the high-quality joint disambiguation effect is ensured.

Disclosure of Invention

The technical problem is as follows: aiming at the structural characteristics of the table and the defects of the current joint disambiguation method, a paired entity linking method is designed. The paired entity link here refers to: and sequentially carrying out joint disambiguation on the pair of entity mentions with the highest confidence level in the table, and reducing the probability of introducing noise while ensuring the high-quality joint disambiguation effect. The linked entities are used for providing richer and more accurate context information for the subsequent entity linking process, and further a better entity linking effect is realized in a real Web table.

The technical scheme is as follows:

the paired entity joint disambiguation method of the present invention is performed by the following steps:

1) and combining every two entity mentions in the same row and column in the Web table to generate all entity mention duplets.

2) Calculating the confidence of all entity mentions when the duplet is linked, and linking a pair of entity mentions with the highest confidence with the respective entity, and deleting other candidate entities mentioned by the pair of entity mentions.

3) The confidence values between the different entity mentions in the table are updated.

4) Iterating said steps 2) and 3) until all entities in the table mention the completion link.

In a preferred embodiment of the present invention, in the step 2), the confidence level calculation is performed as follows:

2-a) confidence computation introduces variable information of column semantic consistency in the linking process. According to the characteristics of the table, the cell contents in the same column have similar semantic characteristics. In an entity linking task, entities linked in the same column usually belong to a certain category together, so that the linked entities have similar vector representation to a certain extent. The column semantic consistency CSC is calculated by:

CSC＝-mean(var([e₁,e₂,…,e_n]))

wherein e₁,e₂,…,e_nVector representation representing linked entities in a column, var is used to obtain variance vectors, mean is used to obtain scalar values representing the semantic consistency of the column by averaging the values in the variance vectors.

2-b) confidence computation introduces variable information of line semantic consistency in the linking process. Row meaning consistency characterizes the consistency of the relationship formed by the link entities in the other columns and the link entities in the primary key column. The row meaning consistency is defined as the negative mean of the relation variance vector, and the smaller the variance, the larger the negative mean of the relation vector, the closer the relation of different rows is, and the more consistent the row meaning is. The rowsense consistency RSC is calculated by:

r＝e_non-subject-e_subject

RSC＝-mean(var([r₁,r₂,…,r_n]))

wherein e_subjectRepresenting linked entities in the primary key column, e_non-subjectRepresenting the linking entities in the non-primary key column and r representing a relationship vector. var is used to obtain variance vector, mean is used to obtain scalar value representing line semantic consistency by averaging the values in the variance vector, r₁,r₂,…,r_nRepresenting a relationship vector representation formed between different row-linked entities.

2-c) confidence computation introduces entity consistency information within the table during the linking process. The link entity consistency is calculated by cosine similarity of the entity vector representation:

EES(e₁,e₂)＝cosine(e₁,e₂)

wherein e₁,e₂Referring to two entities in a pairwise entity Joint disambiguation ProcessThe corresponding entity vector representation is mentioned.

2-d) confidence calculation introduces entity mention and candidate entity similarity information. The similarity between the entity mention and the candidate entity is calculated by combining the cosine similarity and the prior probability of the entity mention context vector representation and the candidate entity context vector representation. The context of entity mention is composed of the bag of words in the same row and column, and the context of candidate entity is composed of the bag of words in the text description of entity in the knowledge base. The entity reference context vector representation is derived from the average of all word vectors in its bag of words, and the candidate entity context vector representation is derived from the average of all word vectors in its bag of words, as shown in particular below:

MES(m,e)＝cosine(m_context,e_context)+P(e|m)

wherein m is_contextA context vector representation representing an entity mentioning m, e_contextRepresenting the context vector of the candidate entity e, P (e | m) represents the probability that m is linked to e.

2-c) confidence calculation of the method of combining the various information. Mention of m given a pair of entities_i,m_jAnd their corresponding candidate entity sets CS_i,CS_jConfidence is defined herein as Γ (m)_i,m_j) The method mainly comprises two parts of contents, wherein one part of contents is the similarity between elements related to paired entity links, the other part of contents is the change of row (column) semantic consistency brought by link operation, and the hyper-parameter beta>And 0, the influence degree proportion used for controlling semantic consistency. The details are as follows:

the similarity calculation mainly comprises three parts, namely the similarity between two entities and respective candidate entities and the correlation between the candidate entities.

And

respectively as candidate entity sets CS_iAnd CS_jThe candidate entity in (1); MES is used for calculating the similarity between entity reference and candidate entity; EES is used to measure the correlation between entities to be linked. Δ CSC_NAnd Δ RSC_NDenotes the mentioning of m for an entity_i,m_jAnd (5) after the link operation is finished, the regularization result of the row and column semantic consistency change values is obtained. The regularization operation is as follows:

Norm(d)＝σ(d)-0.5

wherein d is a variation value of semantic consistency, if d is greater than 0, the semantic consistency is increased, and then the confidence value is improved. σ in the formula is a logistic sigmoid function and the regularization operation is such that norm (d) is a member of (-0.5, 0.5).

Has the advantages that: compared with the prior art, the invention has the following advantages:

the current table entity linking task mostly employs a joint disambiguation strategy to disambiguate multiple entity references simultaneously. The method mainly comprises a probabilistic graphical model, a random walk algorithm, an iterative optimization strategy and the like. In calculating the similarity between the entity mention and the candidate entity, the entity to be linked and the entity already linked in the same row and column are considered to be as related as possible. Entities in the same row that are not the primary key column often have strong correlations with entities in the primary key column, but do not necessarily have high correlations with other columns. Also, linked entities in the same column may not be particularly relevant, subject to imperfections in the knowledge-graph. In the joint disambiguation process, when the entity with low relevance is promoted, not only information cannot be provided mutually, but also noise can be introduced, or the entity cannot be linked to the correct entity due to incomplete knowledge graph, so that the link of other cells is influenced. But simply abandon the joint disambiguation strategy, which results in the loss of important information and thus affects the effect of the final entity link. The invention designs a pair entity joint disambiguation method aiming at the characteristics of the table, iteratively and oppositely uses the pair of entities with the highest reliability to mention the joint disambiguation, and gradually realizes all disambiguation of the entities in the whole table. The confidence coefficient calculation method comprehensively considers various information, ensures the reliability of calculation and realizes high-quality joint disambiguation.

The practical effect proves that the method for matching the examples provided by the invention can complete the link tasks of the Web form entities of different types. The invention has better effect on the micro accuracy and the macro accuracy.

Drawings

Fig. 1 is a schematic diagram of the framework of the present invention.

Fig. 2 is a schematic diagram of row (column) consistency calculation in the present invention.

FIG. 3 is a diagram illustrating an example of a pair-wise entity linking process in an embodiment of the invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

The invention designs a form entity linking task completed by an entity joint disambiguation algorithm, which mainly comprises the following steps:

1) column semantic consistency calculation.

According to the characteristics of the table, the cell contents in the same column have similar semantic characteristics. In the entity linking task, entities linked in the same column generally belong to a certain category together, so that the linked entities have similar vector representations to a certain extent.

Given a column of data in a Web table, variance is first calculated element-by-element for vector representations linking entities in the column, resulting in a variance vector whose dimensions are the same as those represented by the entity vectors. The variance vector represents the degree of dispersion of linked entities in the column, with smaller variances indicating more similar linked entities. The invention defines the column semantic consistency as the negative mean of the variance vector, and the smaller the variance, the larger the negative mean of the variance vector, the larger the column semantic consistency, and the more similar the linked entities. Computation process the column semantic computation in fig. 2, given the linked entities already in the column, column semantic consistency can be formalized as follows:

CSC＝-mean(var([e₁,e₂,…,e_n]))

wherein e₁,e₂,…,e_nAnd representing the vector representation of a column of linked entities in the upper graph, wherein var is used for obtaining a variance vector, and mean obtains a scalar value representing the semantic consistency of the column by averaging the values in the variance vector.

2) And (5) carrying out line meaning consistency calculation.

Similarly, the cell contents in the same row in the table also have certain semantic properties. Unlike the semantic properties in columns, the contents of different cells in a row will typically correspond to different types of linked entities that do not have similar properties. Thus, row semantic consistency is defined herein by the relationship between columns and columns based on commonality information for different columns of the same row in the table.

Row meaning consistency characterizes the consistency of the relationship formed by the link entities in the other columns and the link entities in the primary key column. The primary key column represents the most important column content of a row for which there is an identifying effect, usually referring to some entity. In order to calculate the row semantic consistency of the relationship between the primary key column and the other columns, a relationship vector is first obtained. The relationship vector is computed by the difference of the linked entity vector representation in the primary key column and the linked entity vector representation in the non-primary key column. In general, the relationship of any row in two columns should be the same, and therefore the representation of the relationship vector should also be close. When row meaning consistency is calculated, firstly, the relation vectors of the main key column and other columns are obtained. And then, calculating the variance of the relation vectors calculated by the two rows of elements element by element to obtain a relation variance vector, wherein the dimensionality of the relation variance vector is the same as the dimensionality of the entity vector. The variance vector represents the degree of dispersion of the relationship. Smaller variances indicate more similar relationships between two different rows. The line semantic consistency is the negative mean of the relation variance vector, the smaller the variance is, the larger the negative mean of the relation vector is, the closer the relation of different lines is, and the more consistent the line semantic consistency is. The calculation process refers to the row semantic calculation in fig. 2, first calculating a relationship vector representation between the given two columns of linked entities, and then calculating row semantic consistency based on the relationship vector using a method similar to the column semantic consistency calculation.

r＝e_non-subject-e_subject#

RSC＝-mean(var([r₁,r₂,…,r_n]))#

The above formula gives the calculation method of the relationship vector, e_subjectRepresenting linked entities in the primary key column, e_non-subjectRepresenting linked entities in non-primary key columns. RSC defines the way in which rowed meanings are consistent.

3) And (4) entity consistency calculation.

Entity consistency is calculated by linking cosine similarity of entities:

EES(e₁,e₂)＝cosine(e₁,e₂)

wherein e₁,e₂Refers to the two entities referring to the corresponding entity vector representation in the pairwise entity joint disambiguation process.

4) Entity mentions and candidate entity similarity calculations.

The similarity between the entity mention and the candidate entity is calculated by combining the cosine similarity and the prior probability represented by the entity mention context vector and the candidate entity context vector. The context of entity mention is composed of the bag of words in the same row and column, and the context of candidate entity is composed of the bag of words in the text description of entity in the knowledge base. The entity-mentioned context vector representation is obtained by the average value of all word vectors in the word bag, and the candidate entity context vector representation is obtained by the average value of all word vectors in the word bag, which is specifically shown as follows:

MES(m,e)＝cosine(m_context,e_context)+P(e|m)

wherein m is_contextA context vector representation representing an entity mentioning m, e_contextRepresenting the context vector representation of the candidate entity e, P (e | m) represents the probability that m links to e, which is calculated by the entity popularity. In a set of candidate entities mentioned by an entity, different candidate entities tend to have different degrees of importance or popularity. For example, the probability that an entity mentions "Chicago" links to the entity "Chicago" throughout the Web environment is greater than that it links to the entity "Chicago (oscar bonus movie)". This independent feature is very useful for entity linking, and the invention is based on the Wikipedia statistical entity's prior probability of referring to a link to an entity. First collect from all anchor text, redirect page, disambiguation page<Character strings, entities>And calculating the proportion of the character string linked to a certain entity as the entity link prior probability, wherein the specific formula is shown as follows.

Where the string m will be referred to as an entity in the entity link, f (m, e) represents the frequency with which the strings m and e co-occur, and f (e) represents the total number of occurrences of the entity e. The a priori statistics are referenced in the table below.

5) And (5) calculating confidence.

The paired entity joint disambiguation algorithm selects a pair of entity mentions with the highest confidence coefficient to perform joint disambiguation preferentially, wherein the confidence coefficient mainly comprises row (column) semantic consistency, similarity between the entity mentions and candidate entities and correlation between the entities. Mention of m given a pair of entities_i,m_jAnd their corresponding candidate entity sets CS_i,CS_jConfidence is defined herein as Γ (m)_i,m_j). Referring to the following formula, the confidence calculation can be mainly divided into two parts, one part is the similarity between elements related to paired entity links, the other part is the change of row (column) semantic consistency brought by link operation, and the hyper-parameter beta>And 0, the influence degree proportion used for controlling semantic consistency.

The similarity calculation mainly comprises three parts, namely the similarity between two entities and respective candidate entities and the correlation between the candidate entities. With reference to the following formula,

and

respectively as candidate entity sets CS_iAnd CS_jThe candidate entity of (1); MES is used to calculate the similarity between entity mention and candidate entities, the calculation method uses the deep semantic matching model introduced in section 4.1; EES is used to measure the correlation between the entities to be linked, and the calculation method is the cosine similarity of the pre-training entity vector.

In the process of calculating the confidence coefficient, a row (column) semantic consistency value is not directly used, but a row (column) semantic consistency change value is adopted. Δ CSC, see the following equation_NAnd Δ RSC_NDenotes the mentioning of m for an entity_i,m_jAnd (5) after the link operation is finished, normalizing the change values of the row and column semantic consistency. Calculating Δ CSC_N(or. DELTA. RSC)_N) Firstly, calculating the semantic consistency of the columns (or rows) before and after the link, and then carrying out a regularization operation on the variation value of the semantic consistency.

The regularization operation is realized based on the following formula, wherein d is a change value of semantic consistency, and if d is greater than 0, the semantic consistency is increased, so that the confidence value is improved. σ in the formula is a logistic sigmoid function and the regularization operation is such that norm (d) is a member of (-0.5, 0.5).

Norm(d)＝σ(d)-0.5#

In the example of FIG. 3, when link m is completed₄To e₄、m₁₂To e₁₂Thereafter, the first and third columns have new linking entities added. At this time, the column semantic consistency in the first column and the third column and the line semantic consistency in the first column and the third column are changed, and when the change is a positive value, it is described that the semantic consistency is increased, and the corresponding confidence degree takes a higher value.

6) Pairwise entity join disambiguation algorithms.

The pair-wise entity joint disambiguation in the table is shown in algorithm 1. The input to the algorithm is all entity references in the table and the corresponding set of candidate entities. The algorithm first combines every two mention entities in the same row and column in the table to generate all entity mention doublets, i.e., mpSet, corresponding to rows 1-8 in algorithm 1. Execution of the algorithm iteration then jointly disambiguates the entities:

1) returning each pair of entity mentions using the top function (m)_i,m_j) The result to be linked and the corresponding confidence degree correspond to lines 9-14 in the algorithm 1, wherein the confidence degree is calculated by a formula 26.

2) All entity mention bigrams are sorted based on confidence through the mostConf function, and entity mention bigrams with the highest confidence are linked, corresponding to lines 15-16 in algorithm 1.

And continuously iterating in the paired entity joint disambiguation algorithm, and finishing at least one link mentioned by the entity in each iteration to finally realize the links mentioned by all the entities. The link quality mentioned by the entity is ensured by the link process, and the effect of joint disambiguation is exerted to the maximum extent.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the specific embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A Web table-oriented paired entity joint disambiguation method is characterized by comprising the following steps:

1) combining every two entity mentions in the same row and column in a Web table to generate all entity mention binary groups;

2) calculating the confidence degrees when all entity mentions the binary group for linking, linking a pair of entity mentions with the highest confidence degree with respective entities, and deleting other candidate entities mentioned by the pair of entity mentions;

2. The Web table-oriented paired entity joint disambiguation method of claim 1, wherein in step 2), the confidence level is calculated as follows:

2-a) confidence calculation introduces the change information of the column semantic consistency in the linking process, the column semantic consistency is defined as the negative mean of the variance vector, and the column semantic consistency CSC is calculated by the following method:

CSC＝-mean(var([e₁，e₂...，e_n]))

wherein e₁，e₂，...，e_nRepresenting vector representation of a column of linked entities, obtaining a variance vector by var, and obtaining a scalar value representing column semantic consistency by mean through averaging values in the variance vector;

2-b) confidence calculation introduces the change information of line semantic consistency in the linking process, the line semantic consistency is defined as the negative mean value of the relation variance vector, the smaller the variance is, the larger the negative mean value of the relation vector is, the closer the relation of different lines is, the more consistent the line semantic consistency is, and the line semantic consistency RSC is calculated by the following method:

r＝e_non-subject-e_subject

RSC＝-mean(var([r₁，r₂，...，r_n]))

wherein e_subjectRepresenting linked entities in the primary key column, e_non-subjectRepresenting linked entities in non-primary key columns, r representing a relationship vector, var obtaining a variance vector, mean obtaining a scalar value representing row semantic consistency by averaging values in the variance vector, r₁，r₂，...，r_nRepresenting a relationship vector representation formed between different row link entities;

2-c) confidence calculation introduces entity consistency information in the table in the linking process, and the linked entity consistency EES is calculated through cosine similarity represented by entity vectors:

EES(e₁，e₂)＝cosine(e₁，e₂)

wherein e₁，e₂Refers to the two entities referring to the corresponding entity vector representation in the pairwise entity joint disambiguation process.

2-d) confidence degree calculation introduces entity mention and candidate entity similarity information, wherein the similarity MES of the entity mention and the candidate entity is calculated by combining cosine similarity and prior probability represented by an entity mention context vector representation and a candidate entity context vector representation, the entity mention context is composed of word bags of all words in the same row and column, the candidate entity context is composed of word bags of all words in entity text description in a knowledge base, the entity mention context vector representation is obtained by the average value of all word vectors in the word bags, and the candidate entity context vector representation is obtained by the average value of all word vectors in the word bags, which is shown as follows:

MES(m，e)＝cosine(m_context，e_context)+P(e|m)

wherein m is_contextA context vector representation representing an entity mentioning m, e_contextA context vector representation representing a candidate entity e, P (e | m) representing the probability that m is linked to e;

2-c) method of combining multiple information by confidence calculation, given a pair of entities mentioning m_i，m_jAnd their corresponding candidate entity sets CS_i，CS_jConfidence is defined as Γ (m)_i，m_j) The method comprises two parts of contents, wherein one part of contents is the similarity between elements related to paired entity links, the other part of contents is the change of row/column semantic consistency brought by link operation, and a hyper-parameter beta is more than 0 and is used for controlling the influence degree proportion of the semantic consistency, and the method is specifically as follows:

the similarity calculation comprises three parts, namely, the similarity between two entities and respective candidate entities is mentioned, and the correlation between the candidate entities,

and

respectively as candidate entity sets CS_iAnd CS_jThe candidate entity in (1); MES is used for calculating the similarity between entity reference and candidate entity; EES is used to measure link entity consistency, Δ CSC_NAnd Δ RSC_NDenotes the mentioning of m for an entity_i，m_jAnd (3) after the link operation is finished, the regularization result of the row and column semantic consistency change values is obtained, and the regularization operation is as follows:

Norm(d)＝σ(d)-0.5

wherein d is a variation value of semantic consistency, if d is more than 0, the semantic consistency is increased, and the confidence value is further improved. σ in the formula is a logistic sigmoid function and the regularization operation is such that norm (d) is a member of (-0.5, 0.5).