CN106909679B

CN106909679B - Asymptotic entity identification method based on multi-path block division

Info

Publication number: CN106909679B
Application number: CN201710122912.XA
Authority: CN
Inventors: 申德荣; 孙琛琛; 寇月; 聂铁铮; 于戈
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2020-02-07
Anticipated expiration: 2037-03-03
Also published as: CN106909679A

Abstract

The invention relates to a asymptotic entity identification method based on multipath block division, which comprises the following steps: generating intersected blocks by multiple paths of blocks, eliminating block redundancy by constructing a block diagram, initializing block credit and candidate pair credit, sequencing candidate pairs according to the credit, and sequentially inserting the candidate pairs into a candidate queue; then, the following three steps are carried out iteratively, (1) candidate pairs of the candidate queue are processed, (2) the credit of a part of the candidate pairs is updated according to the identification result, and (3) the sequence of the candidate queue is adjusted according to the updated credit of the candidate pairs, and the identified repeated data object pairs are output gradually, and the three steps are repeated until the candidate queue is empty. By adopting the asymptotic entity identification method, more repeated data objects can be identified given a shorter time budget; the credit of the candidate pairs is updated by dynamically estimating the redundancy of the blocks, and the candidate pair which is most likely to be matched is selected in real time for identification, so that high asymptotic property is ensured.

Description

Asymptotic entity identification method based on multi-path block division

Technical Field

The invention belongs to the field of data quality and data integration, and mainly relates to a progressive entity identification method based on multi-path block division.

Background

In the big data era, an important characteristic of data is diversity, and data objects describing the same entity in the real world may repeatedly appear in different forms in a single or multiple data sources, thereby resulting in low quality of data, reducing the usability and value of data, and becoming a bottleneck in big data integration, processing, analysis and mining. Entity identification is an important aspect of data quality, and repeated data objects describing the same entity are divided into the same group by analyzing a dirty data set, so that the aim of improving the data quality is fulfilled. Entity recognition typically deals with structured data objects, including data records in relational databases, data records in CSV files, data records in XML files, and the like. Entity identification, also known as entity parsing, entity matching, record joining, duplicate detection, record deduplication, entity parsing, reference disambiguation, deduplication, merging and purging, and the like. Entity identification has wide application requirements in a plurality of fields, including customer relationship management, census, medical health, online shopping price comparison, national security, citation database, spam detection, associated data, machine reading and the like. The existence of data redundancy is a direct reason for entity identification. Data redundancy can be divided into two categories: (1) data objects describing the same real-world entity may be added to the same data source multiple times, and such redundancy is referred to as redundancy of a single data source; (2) when integrating data from multiple data sources, data objects from different data sources may correspond to the same entity, and this type of redundancy is referred to as redundancy across data sources.

Entity identification mainly comprises three steps: data blocking, data object similarity calculation and data object pair matching decision. Firstly, data partitioning is also called as data indexing and is used for reducing a search space, reducing useless data object comparison and improving the identification speed; data chunking is an optional step. Secondly, calculating the similarity between the data objects is an important link of entity identification, and if the similarity of a data object pair is higher, the matching possibility of the data object pair is higher; the similarity calculation function is used for the similarity calculation. Finally, after the data object similarity is obtained, it is necessary to determine whether the data objects are matched (repeated) by using the data object similarity, and there are various methods for determining matching currently.

As a necessary preprocessing step for data mining and data analysis, the conventional entity recognition method takes the whole dirty data set as input, and outputs the recognition result after the processing is completed. However, many applications currently require (near) real-time data analysis, which conventional entity recognition techniques cannot meet. Asymptotic entity recognition can optimize recognition results to the greatest extent possible given the shorter time, thereby addressing the foregoing need. For example, the information flow application of financial news generally emphasizes real-time performance so that listeners can perform corresponding financial business processing in time. Financial data of stock market changes very fast, and new data can be generated at intervals; financial data may involve the names of a large number of companies and individuals, and it is not possible to identify all of these data in a short period of time. Information streaming applications require a large number of company and individual names to be identified in a short amount of time before financial news is released. To address such application requirements, new entity identification methods should process as many matching data object pairs as possible within a given short period of time.

Asymptotic entity identification. Compared with the traditional entity identification, the asymptotic entity identification needs to additionally satisfy the following two conditions: (1) early recognition results are better. Given any shorter time t, the asymptotic entity identification method is able to identify more repeated data object pairs than the conventional entity identification method. The time period t is much less than the full entity identification runtime. (2) The same final recognition result. If both the conventional entity recognition method and the asymptotic entity recognition method are operated to terminate naturally, both should produce the same recognition result.

Disclosure of Invention

Aiming at the defects of the existing asymptotic entity identification method, the invention provides an efficient asymptotic entity identification method based on multi-path blocks.

The technical scheme adopted by the invention is as follows:

a asymptotic entity identification method based on multipath block division comprises the following steps:

step 1, multi-path blocking. The main purpose of this step is to utilize a plurality of blocking keys K ═ K_iI is more than or equal to |0 and less than or equal to | K | }, and a multi-path blocking result is generated and mainly divided into the following two substeps.

Substep 1. single-pass blocking. Given a dirty data set R ═ { R } and a chunk key k, R is partitioned into chunks b ═ d according to the key value r.k for a data object R (r.k). d () is an allocation function. The blocking result sets B are disjoint,

and substep 2. demultiplexing. Demultiplexing using the result of substep 1And (5) blocking. Given a dirty data set R and a set of block keys K_i|0≤i≤|K|}，B_iIs by means of a key k_iThe resulting one-way chunking result set, then the multi-way chunking result set generated according to K is B_m＝B₁∪B₂∪…∪B_|K|。B_mIs a set of intersections, each data object is most likely to appear in | K | different blocks.

And 2, generating a candidate queue. The main purpose of this step is to remove the block redundancy and generate a candidate queue, which is mainly divided into the following four sub-steps.

And 1, initializing block credit. A pair of data objects in a block is called a candidate data object pair or candidate pair and is denoted as<r_i,r_j>,r_i,r_jE.g. b. A block is a collection of data objects, and the potential of block b is the total number of pairs of different data objects in b, counted as

In the entity identification process, all identified pairs of data objects in a block b form an identified set denoted xi (b). Xi (b) all matching (repeating) data object pairs constitute a set of matches denoted xi⁺(b) In that respect Given a block b, the confidence of the block is positively correlated to the current matching set size of the block, negatively correlated to the potential of the block,

σ_d(b)＝(|Ξ+(b)|+1)/(||b||+1) (1)

calculating the set of multi-path block results as B using the above formula_mBlock credits of each block in (1).

And substep 2. block redundancy elimination. Redundancy due to multipath block partitioning is eliminated by constructing a block diagram. Given a result set B of multi-way blocks_mThere is one undirected graph G ═ V, E, called a chunky graph. V is a node set, and any node V belongs to V and corresponds to B_mOf the data object. E is the set of edges, for any edge E (v)_i,v_j) E.g. E (denoted as E)_ij) Data object v_i,v_jAt least commonly present in B_mIn one blockAnd (4) the following steps. Both R and V may represent data objects and both R and V may represent data sets. And generating a candidate pair set P by traversing the edges of the block diagram, wherein two data objects corresponding to each edge correspond to a unique candidate pair.

And 3. initializing the credit of the candidate pair. Given a candidate pair<r_i,r_j>Its confidence estimates the likelihood of a match for the candidate pair. The confidence of the candidate pair is to aggregate the matching possibilities provided by the co-occurrence blocks of the candidate pair, and to reduce with the total number of keys,

the confidence of each candidate pair in the candidate pair set P is calculated using the above formula.

And substep 4. sorting the candidate pairs. And arranging the candidate pairs in the candidate pair set P in a descending order according to the credit degrees, and sequentially inserting the candidate pairs into the candidate queue Q.

And 4, processing the candidate data object pair in an iterative manner. The main purpose of this step is to iteratively process candidate data object pairs and gradually output the most recently identified duplicate data object pairs.

Substep 1. candidate pair comparison. Taking a candidate pair from the head of the candidate queue Q<v_i,v_j>Identifying pairs of matching functions by entities<v_i,v_j>A comparison is made. If it is not<v_i,v_j>Is determined to be repetitive, the following operations are performed: the Look-around function is called to directly identify more candidate pairs. And outputting the identified repeated data object pairs.

Look-around function: when a duplicate data object pair is identified<v_i,v_j>And<v_j,v_k>then directly will<v_i,v_k>And the data object pair is determined to be the repeated data object pair, so that one time of data object pair comparison is saved.

And step 2, updating the credit of the candidate pair. Based on the recognition result, the block in which the most recent data object pair recognized as a duplicate is located is found, referred to as the affected block. Since the proportion of the affected blocks identified as duplicate data objects is increased, the dynamic block credits of the affected blocks are updated according to equation (1). The unidentified candidate pairs contained in these affected blocks are called affected candidate pairs, and the credits of these affected candidate pairs are updated with the new block credits according to equation (2).

And 3, adjusting the candidate queue. And re-sorting the candidate queues in a descending order according to the credit degrees of the new candidate pairs.

The above three sub-steps are repeated until the time budget is exhausted or the candidate queue is empty.

The invention has the advantages that: by adopting the asymptotic entity identification method, more repeated data objects can be identified given a shorter time budget (which is far lower than the total time of entity identification); the credit of the candidate pairs is updated by dynamically estimating the redundancy of the blocks, and the candidate pair which is most likely to be matched is selected in real time for identification, so that high asymptotic property is ensured.

Drawings

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a block diagram G corresponding to the block set in step 2 of the detailed implementation manner_B。

FIG. 3 is a graph comparing the real-time recall rate of the present invention with two other methods known in the art.

FIG. 4 is a graph comparing the asymptotic behavior of the present invention with that of two other methods.

Detailed Description

The following is an example of one embodiment of the present invention.

As shown in table 1, there is a sample data set containing 7 records. This is a dirty data set and the corresponding real recognition result is { { r { (R) }₁，r₂，r₃，r₄}，{r₅}，{r₆}，{r₇}}. It is currently desirable to identify this dirty data set asymptotically, that is, to try to identify the most duplicate record pairs given a shorter run time.

Table 1 sample dirty data set containing 7 personal records with attributes of name, age, work and city.

ID	Name (I)	Age (age)	Work by	City
					r₁	John Young	29	Waiter	Poston
r₂	John Joung	29	Waiter	Boston
					r₃	Jon Young	-	Waiter	Boston
r₄	John Young	29	Waiter	Boston
					r₅	Bob Brown	27	Waiter	Austin
r₆	Jeff Allen	29	-	Boston
					r₇	Will Green	29	Teacher	Boston

1. First, demultiplexing is performed. For the dirty data set in table 1, the name, age, work, and city are separately multi-chunked as keys, resulting in a result set,

B_m＝B_surname∪B_age∪B_job∪B_city

B_surname＝{b_s1＝{r₁,r₃,r₄},b_s2＝{r₂},b_s3＝{r₅},b_s4＝{r₆},b_s5＝{r₇}}

B_age＝{b_a1＝{r₁,r₂,r₄,r₆,r₇},b_a2＝{r₅}}

B_job＝{b_j1＝{r₁,r₂,r₃,r₄,r₅},b_j2＝{r₇}}

B_city＝{b_c1＝{r₂,r₃,r₄,r₆,r₇},b_c2＝{r₁},b_c3＝{r₅}}

2. redundancy is then eliminated by building block maps. The above block set B_mThere are 33 candidate pairs in total, and there is redundancy. For example, candidate pairs<r₁,r₄>Simultaneously appear in block b_s1,b_a1And b_j1In (1). Building a block diagram, thereby removing B_mBlock redundancy. As shown in FIG. 3, B is obtained_mCorresponding block diagram G_BGraph G_BEach edge in (1) corresponds to a unique candidate pair. As can be seen from fig. 3, after removing the blocking redundancy, the number of candidate pairs is reduced from 33 to 19.

3. Then, block credits and credits of candidate pairs are initialized and candidate queues are generated. The initial block credits and the credits of the candidate pairs of the computer may be calculated, as follows,

block credit: sigma_d(b_s1)＝1/4，σ_d(b_a1)＝1/11，σ_d(b_j1)＝1/11，σ_d(b_c1)＝1/11。

Confidence (descending order) of candidate pairs: sigma_d(<r₁,r₄>)＝0.108，σ_d(<r₃,r₄>)＝0.108，σ_d(<r₁,r₃>)＝0.085，σ_d(<r₂,r₄>)＝0.068，σ_d(<r₆,r₇>)＝0.045，σ_d(<r₂,r₃>)＝0.045，σ_d(<r₁,r₂>)＝0.045，…

According to the credit degree of the candidate pair, the following should be processed firstly<r₁,r₄>Or<r₃,r₄>. According to the real recognition result { { r { (R) }₁,r₂,r₃,r₄},{r₅},{r₆},{r₇} canIt is known that both candidate pairs are repetitive (matching). It follows that the ordering of the initial candidate pairs is very efficient.

And arranging the candidate pairs in descending order according to the credit degrees and inserting the candidate pairs into the candidate queue in sequence.

4. And entering an iterative asymptotic processing stage. And observing the iteration stage by turns. Table 2 presents the first 6 iterations of the present invention to process the dirty data set of table 1. According to the real recognition result { { r { (R) }₁,r₂,r₃,r₄},{r₅},{r₆},{r₇}, each of the first 6 rounds identifies a duplicate data object pair. Therefore, if the entity identification budget is set to 6 times of data object pair comparison, the method of the present invention can identify all the repeated data object pairs within the budget range, which indicates that the asymptotic performance of the method of the present invention is very high.

Table 2 in each iteration, the block credits and the candidate pair at the head of the candidate queue.

Number of iteration rounds	σ_d(b_s1)	σ_d(b_a1)	σ_d(b_j1)	σ_d(b_c1)	Head of lineCandidate pair
						1	1/4	1/11	1/11	1/11	＜r₁，r₄＞
2	2/4	2/11	2/11	1/11	＜r₃，r₄＞
						3	3/4	2/11	3/11	2/11	＜r₁，r₃＞
4	1	2/11	4/11	2/11	＜r₂，r₄＞
						5	1	3/11	5/11	3/11	＜r₂，r₃＞
6	1	3/11	6/11	4/11	＜r₁，r₂＞
						…	…	…	…	…	…

Claims

1. A asymptotic entity identification method based on multipath block division is characterized in that: the method comprises the following steps:

step 1. multi-way blocking, using multiple blocking keys K ═ K_iI is more than or equal to |0 and less than or equal to | K | }, and a multi-path blocking result is generated, which is specifically as follows:

step 1-1. given a dirty data set R ═ { R } and a chunk key k, divide R into a chunk b ═ d according to the key value r.k of a data object R (r.k); d (, x) is an allocation function, the set of blocking results B are disjoint,

b₁,b₂∈B；

step 1-2. using the result of substep 1-1, demultiplexing, giving a dirty data set R ═ { R } and a set of blocking keys K ═ { K }_i|0≤i≤|K|}；B_iIs by means of a key k_iThe resulting one-way chunking result set, then the multi-way chunking result set generated according to K is B_m＝B₁∪B₂∪…∪B_|K|；B_mIs a set of intersections, each data object is most likely to appear in | K | different blocks;

step 2, generating a candidate queue, removing block redundancy and generating the candidate queue, wherein the method comprises the following steps:

step 2-1. Block letterDegree of use is initialized, and a pair of data objects in a block is called a candidate data object pair or a candidate pair and is written as<r_i,r_j>,r_i,r_jE.g. b, a block is a collection of data objects, the potential of block b is the total number of pairs of different data objects in b, and is taken as

In the entity identification process, all identified pairs of data objects in a block b constitute an identified set denoted xi (b), all pairs of matching data objects in a block b constitute a matched set denoted xi (b), and⁺(b) given a block b, the credit of the block is positively correlated with the current matching set size of the block and negatively correlated with the potential of the block;

σ_d(b)＝(|Ξ+(b)|+1)/(||b||+1) (1)

calculating the set of multi-path block results as B using the above formula_mA block credit for each block in (a);

step 2-2, block redundancy elimination, namely, eliminating redundancy brought by multi-path blocks by constructing a block diagram, and giving a result set B of the multi-path blocks_mThere is an undirected graph G ═ V, E, called a blockgraph, where V is a set of nodes, and any node V ∈ V corresponds to B_mE is a set of edges, E (v) for any edge_i，v_j) E, data object v_i，v_jAt least commonly present in B_mWithin one block, R and V may both represent data objects, and R and V may both represent data sets; generating a candidate pair set P by traversing edges of the block diagram, wherein two data objects corresponding to each edge correspond to a unique candidate pair;

step 2-3. initializing credit of candidate pair, and giving a candidate pair<r_i，r_j>Its credit estimates the matching probability of the candidate pair, the credit of the candidate pair is to aggregate the matching probability provided by the co-occurrence blocks of the candidate pair and reduce by the total number of keys;

calculating the credit degree of each candidate pair in the candidate pair set P by using the formula;

step 2-4, sorting the candidate pairs, namely sorting the candidate pairs in the candidate pair set P in a descending order according to the credit degree, and sequentially inserting the candidate pairs into a candidate queue Q;

step 3. iteratively processing the candidate data object pairs, and gradually outputting the newly identified duplicate data object pairs, the method being as follows:

step 3-1, comparing the candidate pairs, and taking one candidate pair from the head of the candidate queue Q<v_i，v_j>Identifying pairs of matching functions by entities<v_i，v_j>Comparing; if it is not<v_i，v_j>Is determined to be repetitive, the following operations are performed: calling a Look-around function to directly identify more candidate pairs; outputting the identified duplicate data object pairs;

look-around function: when a duplicate data object pair is identified<v_i,v_j>And<v_j,v_k>then directly will<v_i,v_k>The data object pair is determined to be the repeated data object pair, so that one time of data object pair comparison is saved;

step 3-2, updating the credit of the candidate pairs, finding out the blocks where the newly identified repeated data object pairs are located according to the identification result, namely the affected blocks, wherein the dynamic block credits of the affected blocks are updated according to formula (1) as the proportion of the affected blocks identified as repeated data objects is increased, the unidentified candidate pairs contained in the affected blocks are called the affected candidate pairs, and the credits of the affected candidate pairs are updated by using the new block credits according to formula (2);

step 3-3, adjusting the candidate queues, and rearranging the candidate queues in a descending order according to the credit degree of the new candidate pairs;

2. The method of claim 1, wherein the method comprises: and 3, iteratively identifying the candidate pairs in the candidate queue, and dynamically adjusting the sequence of the candidate queue according to the identification result, so as to select the most possibly matched candidate pair in real time for identification.