CN108776697A

CN108776697A - A kind of multi-source data collection cleaning method based on predicate

Info

Publication number: CN108776697A
Application number: CN201810578708.3A
Authority: CN
Inventors: 谢子哲; 李论; 刘奇志
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-11-09
Anticipated expiration: 2038-06-06
Also published as: CN108776697B

Abstract

The present invention proposes that a kind of method that the multi-source data collection cleaning method based on predicate is provided effectively can identify most reliable data item from isomorphism multi-source data concentration, is related to the fields such as data cleansing, data fusion.The method includes：1) predicate is excavated with automated method, and the predicate excavated is filtered；2) confidence level of each entity attributes value is concentrated according to predicate derived data；3) attribute value confidence level is established with the relationship between data source confidence level, calculates data source confidence level；4) data source confidence level and attribute value confidence level is combined to find out the highest data item of confidence level.For multiple data sources, the present invention can filter out redundancy, mistake and out-of-date data, leave the highest data of confidence level to coming from different data sources but the identical information of content is analyzed, for subsequent data analysis lay a good foundation, it is of great significance to the efficiency and accuracy rate of follow-up data processing.

Description

A kind of multi-source data collection cleaning method based on predicate

Technical field

The present invention relates to fields such as data cleansing, data fusions, especially a kind of multi-source data collection cleaning based on predicate Method.

Background technology

In the information age, the description data to the same event or object can be found from a large amount of data source, together When due to timing error, format error, accuracy, integrality, ambiguity semantically etc., from different data sources to same There are inconsistencies for the description of entity.After different data sources gather data, solve to belong to same entity description data it Between inconsistency, it is most important to subsequent data analysis.Simple temporal voting strategy --- select more data source to support Description --- be not particularly suited for Web environment instantly, and need to consider data source confidence level, the confidence level of data itself and some Priori designs more complicated cleaning strategy.Existing cleaning strategy includes mainly following several：

No. 201410387772 application documents of Chinese patent disclose a kind of " public transport road based on traffic multisource data fusion Condition processing system and method ", the traffic data that it merges the description public transport road conditions from different data sources obtain for displaying Traffic information.Its input is special traffic data, Credibility judgement is not carried out according to predicate, also not according to data sum number The confidence level of data source is calculated according to the relationship between source.

No. 201110369877 application documents of Chinese patent disclose " a kind of multi-source data integration platform and its structure side Method ", it is managed to different data, and consistency problem is not present between these data.

No. 8190546 application documents of United States Patent (USP) US disclose " Dependency between sources in Truth discovery ", it by the copy relationship between data source establishes probability graph model to assess data source and data Confidence level is not related to assess the confidence level of data with predicate.

Invention content

Goal of the invention：In order to overcome at present in multisource data fusion, the inconsistent problem of the data of identical entity is described, Namely be difficult to determine data reliability initial value in multi-source data consistency problem, and how to combine data source confidence level and The problem of data reliability, the present invention provide a kind of multi-source data collection cleaning side based on data source confidence level and data reliability Method calculates data source confidence level by setting Predicate evaluation data reliability, then by data reliability, finally finds out confidence level Highest data, achieve the purpose that data cleansing.

Technical solution：To realize that above-mentioned technique effect, the present invention propose a kind of multi-source data collection cleaning side based on predicate Method, including step：

(1) predicate model is built：Define priority predicate, state predicate and interaction predicate；Wherein,

Priority predicate is Prior (A_i, A_j), indicate attribute A_iPriority be higher than attribute A_jPriority；

State predicate is：Wherein, t_iIndicate sentence i,Indicate attribute A in sentence i_kAttribute value,Indicate predefinedWithBetween the condition that meets, φ(t_i, t_j) indicate predefined t_iWith t_jBetween the condition that meets；Stat(A_k) indicate to work as t_iAnd t_jWhen meeting condition P and φ, t_i Quality be higher than t_j；

Interacting predicate is：Inter_δ(A₁..., A_l), it indicates when data meet condition δ, the attribute A of the data₁..., A_lAttribute value it is of poor quality；

(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained in data set Priority predicate, state predicate and interaction predicate；

(3) the attribute value confidence level of each data, including step are concentrated according to obtained predicate derived data：

It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is each attribute value of each data It is a constant that impact factor η, η, which is arranged,；

The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value when update, is first used The update of state predicate is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again；

Use state predicate updates the data the step of confidence level of each attribute value and is：Two numbers of data concentration are enumerated two-by-two According to t_iAnd t_jIf t_iAnd t_jIn attribute A_kOn meet state predicate： Then by attribute valueConfidence level subtract η；

It is with the step of predicate updates the data the confidence level of each attribute value is interacted：All data that ergodic data is concentrated, If a data meets some interaction predicate Inter_δ(A₁..., A_l), then by data attribute A₁..., A_lAttribute value Confidence level subtracts η；

(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update, According to the sequence of priority from high to low successively execution priority predicate；

Execution priority predicate Prior (A_i, A_j) the step of be：If a plurality of data are in attribute A_jOn attribute value it is credible Spend it is identical, then by them according to A_iAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th The A of data_jAttribute value confidence level on add n-1；

After (3-4) obtains the confidence level of all properties value, for multi-valued attribute, all confidence levels are returned more than or equal to default The attribute value of threshold value is as a result；For only needing the attribute of one result of return, step (4) to (6) is executed；

(4) confidence level of all properties value is normalized；According to formulaCalculating waits for clearly Wash the confidence level of all data sources in data set；Wherein, λ_iIndicate data source D_iConfidence level, t indicates data source D_iIn one Data, d (t) indicate that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level；

(5) according to formulaUpdate the confidence level of each attribute value, D ' expressions pair In attribute A_jAttribute value is providedData source；Return to step (4) after update；

(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains；For need to only return to a knot The attribute of fruit, it is final result to find out the highest attribute value of confidence level under the attribute.

Further, the definition method of the priority predicate is：For attribute A_iAnd A_jIf meeting p_score(A_i) < p_score(A_j), then define priority predicate Prior (A_i, A_j), indicate attribute A_iPriority p_score(A_i) it is higher than attribute A_jIt is excellent First grade p_score(A_j)；Wherein, H (A_i) indicate attribute A_iShannon entropy, p_n(A_i) indicate attribute A_i All properties value in null values ratio.

Further, the state predicate and interaction predicate are obtained by first order logic predicate method for digging.

Further, before being cleaned to data set, handmarking is carried out for all properties of all data sets, Mark each attribute need return a result or it is multiple as a result, if an attribute need to only return one as a result, if mark The attribute is single-value attribute, by the highest attribute value of confidence level under the attribute is final result when cleaning；If an attribute can Can exist multiple as a result, it is multi-valued attribute then to mark the attribute, confidence level under the attribute is more than to when cleaning the institute of predetermined threshold value It is final result to have attribute value.

Advantageous effect：Compared with prior art, the present invention has the advantage that：

Without assuming an attribute, only there are one right values to exist, and also not dependent on crowdsourcing, is not necessarily to a large amount of manual interventions, profit Relationship between the predicate and data set and attribute value that are gone out with automatic mining finds out attribute value with a high credibility.The present invention passes through digging It digs self-defined predicate and carrys out reliability scoring to attribute value, the attribute with a high credibility in predetermined threshold value is found out for more answer attributes Value is as a result, for being left attribute, in conjunction with the further Update attribute value of relationship of data source confidence level and attribute value confidence level Confidence level, find the highest attribute value of confidence level as a result, to improve data analysis efficiency and data analysis accuracy It is of great significance.Technical solution using the present invention, engineering staff can relatively easily realize related software.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the calculation process schematic diagram that attribute value confidence level in source is updated the data in the present invention.

Specific implementation mode

The present invention is further described below in conjunction with the accompanying drawings.

Fig. 1 show the flow chart of the present invention, and the invention mainly comprises following components：

A) three kinds of predicates are defined first：

1) priority predicate：For attribute A_iAnd A_jIf p_score(A_i) < p_score(A_j), then define a priority meaning Word Prior (A_i, A_j), indicate attribute A_iPriority p_score(A_i) it is higher than attribute A_jPriority p_score(A_j)；Wherein, H (A_i) indicate attribute A_iShannon entropy, p_n(A_i) indicate attribute A_iAll properties value The ratio of middle null values.

H(A_i) calculation formula be：H(A_i)=- ∑_x∈Xp(x)log₂P (x), X are attribute A_iThe codomain of attribute value, p (x) Represent the proportion (not including null values) that attribute value x accounts for all properties value.

2) state predicate：State predicate is first order logic predicate, and form is：

Indicate t_iAnd t_jMeet condition P and φ, then t_iQuality be higher than t_j。

Condition in above-mentioned definitionAnd f_i(v₁, v₂) can be by v₁=v₂ Or v₁≠v₂It replaces.6 predicates being predefined of P during state predicate defines are replaced, and are P respectively₁(v₁, v₂)、P₂ (v₁, v₂)、P₃(v₁, v₂)、P₄(v₁, v₂)、P₅(v₁, v₂)、P₆(v₁, v₂)。P₁(v₁, v₂)、P₂(v₁, v₂) it is suitable for value type Attribute value, P₁(v₁, v₂) indicate v₁Compare v₂Greatly, P₂(v₁, v₂) indicate v₁Compare v₂It is small；P₃(v₁, v₂)、P₄(v₁, v₂) it is suitable for character type The attribute value of type, P₃(v₁, v₂) indicate v₁Compare v₂It is long, P₄(v₁, v₂) indicate v₁Compare v₂It is short；P₅(v₁, v₂)、P₆(v₁, v₂) it is suitable for word The attribute value of type is accorded with, character string is represented in more detail it includes more information, and metric form is with Shannon entropy formula comparison two The information content that a character string includes, P₅(v₁, v₂) indicate v₁Compare v₂In more detail, P₆(v₁, v₂) indicate v₁Compare v₂It is simpler.

3) interaction predicate：Interaction predicate is first order logic predicate, form Inter_δ(A₁..., A_l), it indicates when one Data meet condition δ, then the A of the data₁..., A_lAttribute value is of poor quality.

In above-mentioned definitionWherein P_i' can be by P₁~P₆Arbitrarily Predicate is replaced, while can also be replaced by following 4 predicates：P₇(v₁, v₂)、P₈(v₁, v₂)、P₉(v₁, v₂)、P₁₀(v₁, v₂)；Its In, P₇(v₁, v₂)、P₈(v₁, v₂) it is suitable for the attribute value of character types, P₇(v₁, v₂) indicate v₁Including v₂, P₈(v₁, v₂) indicate v₁ Not comprising v₂；P₉(v₁, v₂)、P₁₀(v₁, v₂) it is suitable for the attribute value of character types and value type, P₉(v₁, v₂) indicate v₁It is equal to v₂, P₁₀(v₁, v₂) indicate v₁Not equal to v₂。

Then predicate excavation is carried out according to data set：It, can be according to formula for priority predicateThe priority for calculating data set all properties obtains；For state predicate and interaction predicate, according to it The definition of first order logic predicate is automatically obtained by single order Inductive Learning.After obtaining state predicate and interaction predicate, in order to The availability of predicate is further increased, domain expert can be asked to remove invalid predicate and obtain final available predicate；

B) confidence level of initialization all properties value first is 0, and artificial arrange parameter η (impact factor) is a reality Number.Then the confidence level that three classes predicate derives each entity attributes value in data set is executed in the following order；

1) use state predicate：Data are enumerated two-by-two and concentrate two datas, if this two data meets some state meaning WordThen willConfidence level subtract η.

2) with interaction predicate：All data that ergodic data is concentrated, if a data meets some interaction predicate Inter_δ(A₁..., A_l), then by data A₁..., A_lAttribute value confidence level subtracts η.

3) priority predicate is used：Due to the priority of attribute be byIt determines, in order to It is newest to make the confidence level of current property value, need to first carry out the high priority predicate of priority, i.e., contained two attributes p_scoreThe sum of smaller priority predicate.Priority predicate Prior (A_i, A_j) function be, if two data t₁、t₂Meet ConditionAttribute A can be passed through at this time_iConfidence level judgeWithWhich is good.Method is, for It is all in A_jThe identical multiple data of upper confidence level, by them according to A_iConfidence level make ascending sort, according to suitable after sequence Sequence, in the A for the data for coming n-th_jAttribute value confidence level on add n-1, thus according to the higher A of priority_iIt distinguishes Confidence level identical A_jValue.The attribute for needing to return multiple results is paid attention to simultaneously, if confidence level is negative, nothing Priority predicate need to be used.

After obtaining the confidence level of all properties value, for needing to return the attribute of multiple results, it is big to return to all confidence levels In the attribute value equal to predetermined threshold value as a result, for need return a result attribute, continue following steps.

C) basisCalculate data source confidence level, and to the confidence level of all data sources into Row normalization, i.e. ∑_iλ_i=1, as shown in Figure 2.It uses againUpdate each attribute value Confidence level, the confidence level of each attribute value, which is equal to, to be provided the sum of the confidence level of data source of the attribute value to be multiplied by oneself original Confidence level notices that the data source confidence level of null values is exactly the confidence level in data source, does not include other offer null values The confidence level of data source.Then, equally the confidence level of all properties value is normalized so that for any one attribute, It is possible to the confidence level for the value got and is 1.It steps be repeated alternatively until the confidence level of data source and each attribute value Confidence level restrains.

D) finally, for the attribute that need to only return to a result, it is most to find out the highest attribute value of confidence level under the attribute Eventually as a result, in conjunction with the result in b) as final result.

Below in conjunction with specific sample, illustrate embodiments of the present invention：

We enable a data source D_iConfidence level be λ_i, which is combined into { A₁..., A_n, and t ∈ D_iFor this The a data of data source, whereinRepresent t corresponding As_jAttribute value.D (t) is enabled to indicate the confidence level of the data again,Indicate attribute valueConfidence level.The confidence level of a data be equal to the data all properties value confidence level it With that is,：

And the confidence level of a data source be equal to it includes all data confidence level average value, i.e., all data The number of the sum of confidence level divided by data：

Meanwhile it is for attribute A to enable D '_jAttribute value is providedData source, then attribute valueConfidence level be all The sum of the confidence level of data source for providing the value is multiplied by oneself original confidence level：

Embodiment：Cleaning data set is as shown in the table, and one shares 5 datas and 5 data sources, wherein t_iFrom data source D_i, describe the scientific research personnel's data for being named as Mary.

Clean data set table

First, data set is simply first observed, does an artificial pretreatment, remove some apparent unreasonable data, So that subsequent data cleansing is more efficient, effect is more preferable.

T above such as in data set₅This data, salary are negative value, hence it is evident that it is unreasonable, and in this data The value of subsequent these three attributes of Research Area, Affiliation and Publication does not make much sense yet, therefore It can be by t₅This data is regarded as noise and directly deletes, and is not involved in subsequent cleaning operation.

T is seen again₄The value of this data, this attribute of its publication is "-", this also belongs to unreasonable number According to, but because t₄The value of other attributes of this data still has reference value, therefore can directly be changed to the value of publication “null”。

After above-mentioned simple pretreatment, obtained data set is as follows：

The first step carries out predicate excavation to data set.

Excavate priority predicate

For priority predicate, the ratio of the entropy and null values of all properties is counted.

Entropy calculation formula：H(A_i)=- ∑_x∈Xp(x)log₂p(x)

Wherein, p (x) represents the proportion (not including null values) that attribute value x accounts for all properties value；

By taking salary as an example, 3 attribute values are shared, are 142k, 120k and 88k respectively.

Wherein,The entropy that attribute salary can then be obtained is：

Similarly：

p_n(Salary)=p_n(ResearchArea)=0

It can obtain：

Three priority predicates are can define according to the above-mentioned relationship that is less than：

Excavation state predicate：

It is automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner：

Excavate interaction predicate：

Equally automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner：

Second step derives the confidence level of each entity attributes value.

The confidence level for initializing all properties value is 0, setting impact factor η=1, for needing to return the category of multiple results Property, if it is 0 that confidence level, which sets threshold value,.Then using corresponding predicate in sequence, (different predicated execution sequences may Generate different results).

State predicate and interaction predicate are all to act on attribute value, and attribute value will not change, therefore state predicate and shape Between state predicate, interaction predicate with interact between predicate and state predicate and interaction predicate between mutually independence, can be with Random order is called.

However, priority predicate acts on the confidence level of attribute value, therefore priority predicate must be stateful in institute It is used after predicate and interaction predicate.

In addition to this, certain sequence is also had to comply between priority predicate and priority predicate.In order to make currently to belong to Property value confidence level be newest, the high priority predicate of priority need to be first carried out, i.e., the sum of the priority of contained two attributes Smaller priority predicate.For cleaning the priority predicate of data set table,P_scoreThe sum of be 3.5,P_scoreThe sum of Also it is 3.5,P_scoreThe sum of be 4.11, so should be according toSequence execution priority predicate.

After executing state predicate, the confidence level of all properties value is as shown in table 1.Here there are 4 datas, need to carry out two-by-two Compare, has carried out 16 comparisons altogether.With t₁And t₂For, according to state predicateBecauseSo willSubtract 1.

Table 1

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	0	0	0
t₂(D₂)	-1	0	0	0
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	0	0

After executing interaction predicate, the confidence level of all properties value is as shown in table 2.Herein according to interaction predicateWith The attribute value confidence level that all Affiliation and Publication values are null is subtracted 1.

Table 2

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	0	0	0
t₂(D₂)	-1	0	0	-1
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	-1	-1

After execution priority predicate, the confidence level of all properties value is as shown in table 3.By taking Research Area as an example, just Beginning Research Area row value be 0,0,0,0], according to priority predicateIt can be with According to Salary row 0, -1, -2, -2) update the value of Research Area row.By confidence level in Research Area row Identical value is resequenced according to the ascending order that Salary is arranged, and adds 0,1,2 after sequence respectively, obtained after reduction sequence 2,1,0, 0].Similarly execution priority predicate againWith

Table 3

	Salary	Research Area	Affiliation	Publication
					t₁(D₁)	0	2	2	1
t₂(D₂)	-1	1	1	-1
					t₃(D₃)	-2	0	0	0
t₄(D₄)	-2	0	-1	-1

Then all properties are marked, scientific research personnel's same time can only there are one wages and one to be subordinate to machine Structure, so Salary and Affiliation only return to one as a result, still the research field of scientific research personnel and works be but Can have multiple, therefore ResearchArea and Publication attributes should return to multiple end values.For multi-valued attribute, The all properties value returned at this time more than or equal to threshold value 0 returns the result that is, for Research Area as { Data Integration, data cleaning Data cleaning&Google Knowledge management Information retrieve }, for Publication, return the result as { Data integration, A diagnostic tool for data errors}。

Third walks, and calculates data source confidence level.

Then pass throughThe confidence value of all properties value is mapped to (0,1), and is normalized, as a result As shown in table 4.And according toCalculate the confidence level of all data sources.

Table 4

	Salary	Research Area	Affiliation	Publication	λ
						t₁(D₁)	0.496353	0.33723	0.369959	0.413275	0.404204
t₂(D₂)	0.26698	0.2799	0.307065	0.152035	0.251495
						t₃(D₃)	0.118333	0.191435	0.210014	0.282655	0.200609
t₄(D₄)	0.118333	0.191435	0.112963	0.152035	0.143692

Finally byWithTwo formulas change The confidence level of the update all properties value in generation is restrained up to the confidence level of all properties value, is paid attention to every time by row Update attribute value It needs to normalize the row confidence level after confidence level.It is credible for the attribute value of Salary row by taking first time renewal process as an example Degree：

0.496353,0.26698,0.118333,0.118333]

→ 0.496353 × 0.404204,0.26698 × 0.251495,0.118333 × 0.344301,0.118333 × 0.344301]

→ 0.200628,0.0671441,0.0407422,0.0407422)

Similarly, for the attribute value confidence level of Research Area row：

0.33723,0.2799,0.191435,0.191435) → 0.500009,0.258216,0.140871, 0.100903]

For the attribute value confidence level of Affiliation row：

0.369959,0.307065,0.210014,0.112963) → 0.404204,0.251495,0.200609, 0.143692)

For the attribute value confidence level of Publication row：

{ 0.413275,0.152035,0.282655,0.152035 } → 0.588542,0.134713,0.199777, 0.0769686]

The confidence level of final updating data source：

λ=0.5168,0.209168,0.164478,0.109554)

It repeats the above process until convergence, final result are as shown in table 5.

Table 5

	Salary	Research Area	Affiliation	Publication	λ
						t₁(D₁)	1	1	1	1	1
t₂(D₂)	0	0	0	0	0
						t₃(D₃)	0	0	0	0	0
t₄(D₄)	0	0	0	0	0

4th step, obtains a result.

According to table 5, attribute value best in Salary and Affiliation attributes can be selected, you can the maximum category of reliability Property value be result.Wherein the result of Salary is { 142k }, and the result of Affiliation is { Amazon }.

The above is only a preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of multi-source data collection cleaning method based on predicate, which is characterized in that including step：

State predicate is：Wherein, t_iIndicate sentence i,Table Show attribute A in sentence i_kAttribute value,Indicate predefinedWithBetween the condition that meets, φ (t_i, t_j) indicate predefined t_iWith t_jBetween the condition that meets；Stat(A_k) indicate to work as t_iAnd t_jWhen meeting condition P and φ, t_iMatter Amount is higher than t_j；

Interacting predicate is：Inter_δ(A₁..., A_l), it indicates when data meet condition δ, the attribute A of the data₁..., A_l's Attribute value is of poor quality；

(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained excellent in data set First grade predicate, state predicate and interaction predicate；

It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is arranged for each attribute value of each data Impact factor η, η are a constant；

The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value, when update, first use state Predicate update is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again；

Use state predicate updates the data the step of confidence level of each attribute value and is：Two data t of data concentration are enumerated two-by-two_i And t_jIf t_iAnd t_jIn attribute A_kOn meet state predicate：Then will Attribute valueConfidence level subtract η；

It is with the step of predicate updates the data the confidence level of each attribute value is interacted：All data that ergodic data is concentrated, if A data meets some interaction predicate Inter_δ(A₁..., A_l), then by data attribute A₁..., A_lAttribute value it is credible Degree subtracts η；

(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update, according to The sequence of priority from high to low execution priority predicate successively；

Execution priority predicate Prior (A_i, A_j) the step of be：If a plurality of data are in attribute A_jOn attribute value confidence level phase Together, then by them according to A_iAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th data A_jAttribute value confidence level on add n-1；

After (3-4) obtain the confidence level of all properties value, for multi-valued attribute, returns to all confidence levels and be more than or equal to default threshold The attribute value of value is as a result；For only needing the attribute of one result of return, step (4) to (6) is executed；

(4) confidence level of all properties value is normalized；According to formulaCalculate number to be cleaned According to the confidence level for concentrating all data sources；Wherein, λ_iIndicate data source D_iConfidence level, t indicates data source D_iIn a number According to d (t) indicates that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level；

(5) according to formulaThe confidence level of each attribute value is updated, D ' expressions are for belonging to Property A_jAttribute value is providedData source；Return to step (4) after update；

(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains；For need to only return to result Attribute, it is final result to find out the highest attribute value of confidence level under the attribute.

2. a kind of multi-source data collection cleaning method based on predicate according to claim 1, which is characterized in that described preferential Grade predicate definition method be：For attribute A_iAnd A_jIf meeting p_score(A_i) < p_score(A_j), then define priority predicate Prior(A_i, A_j), indicate attribute A_iPriority p_score(A_i) it is higher than attribute A_jPriority p_score(A_j)；Wherein, H (A_i) indicate attribute A_iShannon entropy, p_n(A_i) indicate attribute A_iAll properties value The ratio of middle null values.

3. a kind of multi-source data collection cleaning method based on predicate according to claim 2, which is characterized in that the state Predicate and interaction predicate are obtained by first order logic predicate method for digging.

4. a kind of multi-source data collection cleaning method based on predicate according to claim 3, which is characterized in that data Before collection is cleaned, handmarking is carried out for all properties of all data sets, each attribute is marked to need to return to one As a result still multiple as a result, if an attribute need to only return one as a result, if mark the attribute to be single-value attribute, will when cleaning The highest attribute value of confidence level is final result under the attribute；If an attribute there may be it is multiple as a result, if mark the category Property be multi-valued attribute, all properties value that confidence level under the attribute is more than to when cleaning predetermined threshold value is final result.