CN108776697B

CN108776697B - Multi-source data set cleaning method based on predicates

Info

Publication number: CN108776697B
Application number: CN201810578708.3A
Authority: CN
Inventors: 谢子哲; 李论; 刘奇志
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2020-06-09
Anticipated expiration: 2038-06-06
Also published as: CN108776697A

Abstract

The invention provides a predicate-based multi-source data set cleaning method, which can effectively identify the most reliable data items from isomorphic multi-source data sets and relates to the fields of data cleaning, data fusion and the like. The method comprises the following steps: 1) mining predicates by an automatic method, and filtering the mined predicates; 2) deducing the credibility of the attribute value of each entity in the data set according to the predicate; 3) establishing a relation between attribute value credibility and data source credibility, and calculating the data source credibility; 4) and finding out the data item with the highest credibility by combining the credibility of the data source and the credibility of the attribute value. For a plurality of data sources, the invention can analyze the information from different data sources but with the same content, filter out redundant, wrong and outdated data, and leave the data with the highest credibility, thereby having important significance for the efficiency and accuracy of subsequent data processing on the basis of subsequent data analysis and compaction.

Description

Multi-source data set cleaning method based on predicates

Technical Field

The invention relates to the fields of data cleaning, data fusion and the like, in particular to a multi-source data set cleaning method based on predicates.

Background

In the information age, description data of the same event or object can be found from a large number of data sources, and meanwhile, due to time errors, format errors, accuracy, completeness, semantic ambiguity and the like, description of the same entity from different data sources is inconsistent. After data are collected from different data sources, the inconsistency among the description data belonging to the same entity is solved, and the method is of great importance for subsequent data analysis. Simple voting strategies-selecting descriptions supported by more data sources-are not suitable for the current Web environment, and more complex washing strategies need to be designed by considering the credibility of the data sources, the credibility of the data and some priori knowledge. The existing cleaning strategies mainly comprise the following steps:

the application document of chinese patent No. 201410387772 discloses a "system and method for processing bus traffic conditions based on traffic multi-source data fusion", which fuses traffic data describing bus traffic conditions from different data sources to obtain displayable traffic information. The input of the method is specific traffic data, reliability judgment is not carried out according to predicates, and the reliability of a data source is not calculated according to the relation between the data and the data source.

The application document of chinese patent No. 201110369877 discloses a multi-source data integration platform and a construction method thereof, which manages different data, and the data do not have consistency problem.

The application document of US 8190546 discloses "Dependency between sources intussuth discovery" which evaluates the credibility of data sources and data by establishing a probabilistic graph model of the copy relationship between the data sources, and does not involve evaluating the credibility of the data with predicates.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a multi-source data set cleaning method based on data source credibility and data credibility, which aims to solve the problems that the initial value of the data credibility is difficult to determine in the multi-source data consistency problem and how to combine the data source credibility and the data credibility in the multi-source data fusion at present, and achieves the purpose of cleaning data by setting predicates to calculate the data credibility, then calculating the data source credibility through the data credibility, and finally finding out the data with the highest credibility.

The technical scheme is as follows: in order to achieve the technical effects, the invention provides a multi-source data set cleaning method based on predicates, which comprises the following steps:

(1) constructing a predicate model: defining a priority predicate, a state predicate and an interaction predicate; wherein the content of the first and second substances,

the priority predicate is priority (A)_i，A_j) Represents an attribute A_iIs higher in priority than attribute A_jThe priority of (2);

the state predicates are as follows:

wherein, t_iThe expression is given to the sentence i,

representing an attribute A in a statement i_kThe value of the attribute of (a) is,

representing predefined

And

with respect to the condition, phi (t)_i，t_j) Representing a predefined t_iAnd t_jThe condition satisfied in (a); stat (A)_k) When t is shown_iAnd t_jWhen the conditions P and phi are satisfied, t_iIs higher than t_j；

The interaction predicates are as follows: inter_δ(A₁，…，A_l) Indicates that when the data satisfies the condition delta, the attribute A of the piece of data₁，…，A_lThe quality of the attribute value of (2) is poor;

(2) carrying out predicate mining on the data set to be cleaned through the predicate model defined in the step (1) to obtain a priority predicate, a state predicate and an interaction predicate in the data set;

(3) deducing the attribute value credibility of each data in the data set according to the obtained predicate, comprising the following steps:

(3-1) initializing the credibility of all attribute values of the data in the data set to be 0, and setting an influence factor η as a constant for each attribute value of each piece of data;

(3-2) updating the credibility of each attribute value of each piece of data by using the state predicates and the interaction predicates, wherein during updating, the state predicates are firstly used for updating and then the interaction predicates are used for updating, or the interaction predicates are firstly used for updating and then the state predicates are used for updating;

the method for updating the credibility of each attribute value of the data by using the state predicates comprises the following steps: enumerating two data t in the dataset two by two_iAnd t_jIf t is_iAnd t_jAt attribute A_kAnd (3) satisfying the state predicates:

then the attribute value is added

η is subtracted;

the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing data setsIf a piece of data meets a certain interaction predicate Inter_δ(A₁，…，A_l) Then the piece of data attribute A is added₁，…，A_lη is subtracted from the confidence level of the attribute value of (a);

(3-3) after the step (2) is finished, updating the attribute value credibility of each piece of data by using the priority predicates, and executing the priority predicates in sequence from high priority to low priority during updating;

execution priority predicate Primary (A)_i，A_j) Comprises the following steps: if multiple pieces of data are in attribute A_jIf the confidence of the attribute values of (A) is the same, they are set to A_iThe attribute values of (1) are sorted in ascending order according to the reliability of the attribute values, and the attribute values are sorted in the order of the sort in the A of the data of the nth bit_jAdding n-1 to the credibility of the attribute value of (1);

(3-4) after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as results for the multi-value attribute; for the attribute which only needs to return one result, executing the steps (4) to (6);

(4) normalizing the credibility of all attribute values; according to the formula

Calculating the credibility of all data sources in the data set to be cleaned; wherein λ is_iRepresenting a data source D_iT denotes the data source D_iD (t) represents the credibility of the data t, and the credibility of the data t is equal to the sum of the credibility of all attribute values of the data;

(5) according to the formula

Update the confidence of each attribute value, D' represents the confidence for attribute A_jProviding attribute values

The data source of (1); returning to the step (4) after updating;

(6) repeatedly executing the steps (4) to (5) until the credibility of all the attribute values is converged; and for the attribute which only needs to return one result, finding out the attribute value with the highest credibility under the attribute as the final result.

Further, the definition method of the priority predicate includes: for attribute A_iAnd A_jIf p is satisfied_score(A_i)＜p_score(A_j) Then define the priority predicate Prior (A)_i，A_j) Represents an attribute A_iPriority of p_score(A_i) Higher than attribute A_jPriority of p_score(A_j)；

Wherein, H (A)_i) Represents attribute A_iShannon entropy of p_n(A_i) Represents attribute A_iThe null value among all the attribute values of (1).

Further, the state predicate and the interaction predicate are obtained by a first-order logic predicate mining method.

Further, before cleaning the data set, manually marking all attributes of all the data sets, marking whether each attribute needs to return one result or a plurality of results, if one attribute only needs to return one result, marking the attribute as a single-value attribute, and taking the attribute value with the highest reliability under the attribute as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.

Has the advantages that: compared with the prior art, the invention has the following advantages:

attribute values with high reliability are found out by using predicates and the relation between data sets and the attribute values which are automatically mined without assuming that only one correct value exists in one attribute, relying on crowdsourcing and large amount of manual intervention. According to the method, the credibility of the attribute values is scored by mining the custom predicates, the attribute values with the credibility higher than the preset threshold are found out for the multi-answer attributes as results, the credibility of the attribute values is further updated for the remaining attributes by combining the relation between the credibility of the data source and the credibility of the attribute values, the attribute values with the highest credibility are found as the results, and the method has important significance for improving the efficiency of data analysis and the accuracy of data analysis. By adopting the technical scheme of the invention, engineers can easily realize related software.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a process for updating the confidence level of attribute values in a data source according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

Fig. 1 shows a flow chart of the present invention, which mainly includes the following parts:

a) first, three predicates are defined:

1) and (3) priority predicates: for attribute A_iAnd A_jIf p is_score(A_i)＜p_score(A_j) Then define a priority predicate Prior (A)_i，A_j) Represents an attribute A_iPriority of p_score(A_i) Higher than attribute A_jPriority of p_score(A_j)；

H(A_i) The calculation formula of (2) is as follows: h (A)_i)＝-∑_x∈Xp(x)log₂p (X), X is attribute A_iThe value range of the attribute value, p (x), represents the weight of the attribute value x to all attribute values (excluding null values).

2) And (3) state predicates: the state predicate is a first order logic predicate of the form:

represents t_iAnd t_jSatisfy the conditions P and φ, then t_iIs higher than t_j。

Conditions in the above definitions

And f_i(v₁，v₂) Can be v₁＝v₂Or v₁≠v₂And (6) replacing. P in the state predicate definition can be replaced by 6 predicates defined in advance, P respectively₁(v₁，v₂)、P₂(v₁，v₂)、P₃(v₁，v₂)、P₄(v₁，v₂)、P₅(v₁，v₂)、P₆(v₁，v₂)。P₁(v₁，v₂)、P₂(v₁，v₂) Attribute value, P, adapted to numerical type₁(v₁，v₂) Denotes v₁Ratio v₂Large, P₂(v₁，v₂) Denotes v₁Ratio v₂Small; p₃(v₁，v₂)、P₄(v₁，v₂) Attribute value, P, for character type₃(v₁，v₂) Denotes v₁Ratio v₂Length, P₄(v₁，v₂) Denotes v₁Ratio v₂Short; p₅(v₁，v₂)、P₆(v₁，v₂) The attribute value suitable for the character type represents that the character string contains more information in more detail, and the measurement mode is that the Shannon entropy formula is used for comparing the information content contained by the two character strings, P₅(v₁，v₂) Denotes v₁Ratio v₂In more detail, P₆(v₁，v₂) Denotes v₁Ratio v₂And more briefly.

3) Interaction predicates: the interactive predicate is a first-order logic predicate in the form of Inter_δ(A₁，…，A_l) A represents that when a piece of data satisfies the condition delta, A of the piece of data₁，…，A_lThe attribute values are of poor quality.

In the above definition

Wherein P is_i' can be P₁～P₆Any predicate can be replaced, and meanwhile, the predicate can be replaced by the following 4 predicates: p₇(v₁，v₂)、P₈(v₁，v₂)、P₉(v₁，v₂)、P₁₀(v₁，v₂) (ii) a Wherein, P₇(v₁，v₂)、P₈(v₁，v₂) Attribute value, P, for character type₇(v₁，v₂) Denotes v₁Containing v₂，P₈(v₁，v₂) Denotes v₁Does not contain v₂；P₉(v₁，v₂)、P₁₀(v₁，v₂) Attribute values, P, adapted for character types and value types₉(v₁，v₂) Denotes v₁Is equal to v₂，P₁₀(v₁，v₂) Denotes v₁Is not equal to v₂。

Then carrying out predicate mining according to the data set: for the priority predicates, the formula can be used

Calculating the priority of all attributes of the data set to obtain the priority; and for the state predicate and the interaction predicate, the state predicate and the interaction predicate are automatically obtained by a first-order inductive learning method according to the definition of the first-order logic predicate. After the state predicate and the interaction predicate are obtained, in order to further improve the usability of the predicate, a domain expert can be requested to remove an invalid predicate to obtain a final usable predicate;

b) firstly, initializing the credibility of all attribute values to be 0, manually setting parameters η (influence factors) to be a real number, and then executing three types of predicates in the following sequence to deduce the credibility of the attribute values of all entities in the data set;

1) applying a state predicate: number enumerated in pairsTwo pieces of data in a data set, if the two pieces of data satisfy a state predicate

Then will be

η is subtracted.

2) Applying an interaction predicate: traversing all data in the data set, if one data satisfies a certain interaction predicate Inter_δ(A₁，…，A_l) Then the piece of data A is processed₁，…，A_lAttribute value confidence is subtracted η.

3) Applying a priority predicate: since the priority of the attribute is

To update the confidence of the current attribute value, a priority predicate with high priority, i.e., p of two attributes included, is first executed_scoreAnd smaller priority predicates. Priority predicates priority (A)_i，A_j) Is that if two pieces of data t₁、t₂Satisfies the conditions

At this time, the attribute A can be passed_iIs judged according to the reliability of

And

which is good. The method is that for all A_jA plurality of data with the same credibility are processed according to A_iIs sorted in ascending order, in the sorted order, in the order of A of the data arranged in the nth bit_jIs added with n-1 to the confidence of the attribute value, thus according to the A with higher priority_iDistinguish A with the same credibility_jThe value of (c). Note also that for attributes that require multiple results to be returned, if the confidence level is negative, thenNo priority predicates need to be applied.

And after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as the result for the attribute needing to return a plurality of results, and continuing the following steps for the attribute needing to return a result.

c) According to

Calculating the credibility of the data sources and normalizing the credibility of all the data sources, namely sigma_iλ_i1 as shown in fig. 2. Reuse of

And updating the credibility of each attribute value, wherein the credibility of each attribute value is equal to the sum of the credibility of the data sources providing the attribute value multiplied by the original credibility of the attribute value, and the credibility of the data source of the null value is the credibility of the data source of the null value and does not include the credibility of other data sources providing the null value. The confidence levels of all attribute values are then also normalized so that for any one attribute, the sum of the confidence levels of all possible values taken is 1. And repeating the steps until the credibility of the data source and the credibility of each attribute value converge.

d) And finally, for the attribute which only needs to return one result, finding out the attribute value with the highest reliability under the attribute as a final result, and combining the result in the step b) as the final result.

The embodiments of the present invention will be described below with reference to specific examples:

let us order a data source D_iHas a degree of confidence of λ_iThe data source attribute set is { A }₁，…，A_nIs multiplied by t e D_iIs a piece of data of the data source, wherein

Represents t corresponds to A_jThe attribute value of (2). Let d (t) represent the confidence level of the piece of data,

representing property values

The reliability of (2). The credibility of one piece of data is equal to the sum of the credibility of all attribute values of the piece of data, namely:

and the credibility of one data source is equal to the average of the credibility of all the data contained in the data source, namely the sum of the credibility of all the data is divided by the number of the data:

meanwhile, let D' be for attribute A_jProviding attribute values

The data source of (2) the attribute value

The confidence level of (c) is the sum of the confidence levels of all data sources providing the value multiplied by the original confidence level of itself:

example (b): the cleansing data set is shown in the following table, with a total of 5 data and 5 data sources, where t_iFrom a data source D_iData of a scientific researcher named Mary is described.

Cleaning data set table

Firstly, a data set is simply observed, manual preprocessing is carried out, and some obvious unreasonable data are eliminated, so that the subsequent data cleaning efficiency is higher, and the effect is better.

Such as t in the data set above₅The value of salary is negative, obviously unreasonable, and the values of the three attributes of Research Area, affinity and Publication in the data are not significant, so t can be used₅This data is directly deleted as noise without participating in the subsequent cleaning operation.

See again t₄This data, whose publication attribute has a "-" value, also belongs to unreasonable data, but because of t₄The values of other attributes of the data are still of reference value, so that the value of publication can be directly changed to 'null'.

After the above simple pre-processing, the resulting data set is as follows:

firstly, predicate mining is carried out on a data set.

Mining priority predicates

For the priority predicates, the entropy of all attributes and the proportion of null values are counted.

The entropy calculation formula is as follows: h (A)_i)＝-∑_x∈Xp(x)log₂p(x)

Wherein p (x) represents the specific gravity of the attribute value x to all attribute values (null value is not included);

taking salary as an example, there are 3 attribute values, which are 142k, 120k and 88k respectively.

Wherein the content of the first and second substances,

the entropy of the available attribute salary is:

the same principle is that:

p_n(Salary)＝p_n(ResearchArea)＝0

it is possible to obtain:

three priority predicates can be defined according to the less than relationship:

mining state predicates:

automatically obtaining by a First Order logical predicate mining algorithm First Order index indicative Learner:

and (3) mining interaction predicates:

also automatically obtained by a First Order logical predicate mining algorithm First Order index induced Learner:

and secondly, deducing the credibility of the attribute values of all the entities.

The confidence level of all attribute values is initialized to 0, the influence factor η is set to 1, the confidence threshold value is set to 0 for attributes that need to return multiple results, then the corresponding predicates are used in a certain order (different predicate execution orders may produce different results).

The state predicate and the interaction predicate both act on the attribute value, and the attribute value does not change, so that the state predicate and the state predicate, the interaction predicate and the interaction predicate are independent from each other, and can be called in any sequence.

However, the priority predicate works above the trustworthiness of the attribute values, so the priority predicate must be used after all state and interaction predicates.

In addition, the priority predicate and the priority predicate are also connectedA certain order must be observed. In order to make the credibility of the current attribute value up to date, a priority predicate with high priority, namely a priority predicate containing two attributes with smaller sum of priority needs to be executed first. For the priority predicate that cleans the dataset table,

p of (a)_scoreThe sum of the total weight of the components is 3.5,

p of (a)_scoreThe sum is also 3.5 and,

p of (a)_scoreThe sum is 4.11, so should be

The order of execution priority predicates.

After the state predicates are executed, the trustworthiness of all attribute values is shown in Table 1. There are 4 pieces of data, and two pieces of data need to be compared, and 16 times of comparison are carried out in total. With t₁And t₂For example, predicates are based on state

Because of the fact that

So will

Minus 1.

TABLE 1

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	0	0	0
t₂(D₂)	-1	0	0	0
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	0	0

After executing the interaction predicates, the trustworthiness of all attribute values is shown in Table 2. Here according to interaction predicates

And

attribute value credibility for all affinity and Publication values to null is reduced by 1.

TABLE 2

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	0	0	0
t₂(D₂)	-1	0	0	-1
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	-1	-1

After the priority predicates are executed, the trustworthiness of all attribute values is shown in Table 3. Taking the Research Area as an example, the initial Research Area column has a value of {0, 0, 0, 0]Predicates on priority

The value of the Research Area column may be updated according to 0, -1, -2, -2) of the Salary column. Reordering the values with the same reliability in the Research Area column according to the ascending order of the Salary column, adding 0, 1, 2 after the ordering, and reducing the order to obtain {2, 1, 0]. Re-execution priority predicates on the same reason

And

TABLE 3

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	2	2	1
t₂(D₂)	-1	1	1	-1
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	-1	-1

All attributes are then marked, and a researcher can only have one payroll and one affiliate at a time, so both Salary and affinity return only one result, but there can be more than one researcher's research area and work, so the research and Publication attributes should return more than one result value. For the multi-value attribute, all attribute values greater than or equal to the threshold value 0 are returned at this time, that is, for the Research Area, the returned result is { Data integration, Data clarification & Google Knowledge management information retrieval }, and for the Publication, the returned result is { Data integration, adaptive tool for Data errors }.

And thirdly, calculating the reliability of the data source.

Then pass through

The confidence values for all attribute values are mapped to (0, 1) and normalized, with the results shown in table 4. And in accordance with

And calculating the credibility of all data sources.

TABLE 4

	Salary	Research Area	Affiliation	Publication	λ
						t₁(D₁)	0.496353	0.33723	0.369959	0.413275	0.404204
t₂(D₂)	0.26698	0.2799	0.307065	0.152035	0.251495
						t₃(D₃)	0.118333	0.191435	0.210014	0.282655	0.200609
t₄(D₄)	0.118333	0.191435	0.112963	0.152035	0.143692

Finally pass through

And

and (4) updating the credibility of all the attribute values by two formula iterations until the credibility of all the attribute values is converged, and after the credibility of the attribute values is updated by columns every time, the credibility of the columns needs to be normalized. Taking the first updating process as an example, for the attribute value credibility of the Salary column:

{0.496353，0.26698，0.118333，0.118333]

→{0.496353×0.404204，0.26698×0.251495，0.118333×0.344301，0.118333×0.344301]

→{0.200628，0.0671441，0.0407422，0.0407422)

similarly, for the attribute value credibility of the Research Area column:

{0.33723，0.2799，0.191435，0.191435)→{0.500009，0.258216，0.140871，0.100903]

attribute value confidence for affinity column:

{0.369959，0.307065，0.210014，0.112963)→{0.404204，0.251495，0.200609，0.143692)

attribute value confidence for Publication column:

{0.413275，0.152035，0.282655，0.152035}→{0.588542，0.134713，0.199777，0.0769686]

and finally updating the credibility of the data source:

λ＝{0.5168，0.209168，0.164478，0.109554)

the above process was repeated until convergence, and the final results are shown in table 5.

TABLE 5

	Salary	Research Area	Affiliation	Publication	λ
						t₁(D₁)	1	1	1	1	1
t₂(D₂)	0	0	0	0	0
						t₃(D₃)	0	0	0	0	0
t₄(D₄)	0	0	0	0	0

And fourthly, obtaining a result.

According to table 5, the best attribute value of the Salary and affinity attributes can be selected, i.e., the attribute value with the highest confidence level is the result. Wherein the Salary result is {142k } and the Aftilization result is { Amazon }.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A multi-source data set cleaning method based on predicates is characterized by comprising the following steps:

and (3) priority predicates: for attribute A_iAnd A_jIf p is_score(A_i)＜p_score(A_j) Then define a priority predicate Prior (A)_i，A_j) Represents an attribute A_iPriority of p_score(A_i) Higher than attribute A_jPriority of p_score(A_j)；

Wherein, H (A)_i) Represents attribute A_iShannon entropy of p_n(A_i) Represents attribute A_iThe null value ratio among all the attribute values of (1);

the state predicates are as follows:

wherein, t_iThe expression is given to the sentence i,

representing predefined

And

then the attribute value is added

η is subtracted;

the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing all data in the data set, if one data satisfies a certain interaction predicate Inter_δ(A₁，…，A_l) Then the piece of data attribute A is added₁，…，A_lη is subtracted from the confidence level of the attribute value of (a);

(5) according to the formula

λ_kUpdate the confidence of each attribute value, D' represents the confidence for attribute A_jProviding attribute values

The data source of (1); returning to the step (4) after updating;

2. The method for cleaning a multi-source data set based on a predicate of claim 1, wherein the state predicate and the interaction predicate are both obtained by a first-order logic predicate mining method.

3. The multi-source data set cleaning method based on predicates of claim 2, wherein before cleaning the data set, all attributes of all the data sets are manually marked, whether each attribute needs to return one result or a plurality of results is marked, if one attribute only needs to return one result, the attribute is marked as a single-value attribute, and the attribute value with the highest reliability under the attribute is taken as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.