CN108776697A - A kind of multi-source data collection cleaning method based on predicate - Google Patents

A kind of multi-source data collection cleaning method based on predicate Download PDF

Info

Publication number
CN108776697A
CN108776697A CN201810578708.3A CN201810578708A CN108776697A CN 108776697 A CN108776697 A CN 108776697A CN 201810578708 A CN201810578708 A CN 201810578708A CN 108776697 A CN108776697 A CN 108776697A
Authority
CN
China
Prior art keywords
data
predicate
attribute
confidence level
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810578708.3A
Other languages
Chinese (zh)
Other versions
CN108776697B (en
Inventor
谢子哲
李论
刘奇志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201810578708.3A priority Critical patent/CN108776697B/en
Publication of CN108776697A publication Critical patent/CN108776697A/en
Application granted granted Critical
Publication of CN108776697B publication Critical patent/CN108776697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention proposes that a kind of method that the multi-source data collection cleaning method based on predicate is provided effectively can identify most reliable data item from isomorphism multi-source data concentration, is related to the fields such as data cleansing, data fusion.The method includes:1) predicate is excavated with automated method, and the predicate excavated is filtered;2) confidence level of each entity attributes value is concentrated according to predicate derived data;3) attribute value confidence level is established with the relationship between data source confidence level, calculates data source confidence level;4) data source confidence level and attribute value confidence level is combined to find out the highest data item of confidence level.For multiple data sources, the present invention can filter out redundancy, mistake and out-of-date data, leave the highest data of confidence level to coming from different data sources but the identical information of content is analyzed, for subsequent data analysis lay a good foundation, it is of great significance to the efficiency and accuracy rate of follow-up data processing.

Description

A kind of multi-source data collection cleaning method based on predicate
Technical field
The present invention relates to fields such as data cleansing, data fusions, especially a kind of multi-source data collection cleaning based on predicate Method.
Background technology
In the information age, the description data to the same event or object can be found from a large amount of data source, together When due to timing error, format error, accuracy, integrality, ambiguity semantically etc., from different data sources to same There are inconsistencies for the description of entity.After different data sources gather data, solve to belong to same entity description data it Between inconsistency, it is most important to subsequent data analysis.Simple temporal voting strategy --- select more data source to support Description --- be not particularly suited for Web environment instantly, and need to consider data source confidence level, the confidence level of data itself and some Priori designs more complicated cleaning strategy.Existing cleaning strategy includes mainly following several:
No. 201410387772 application documents of Chinese patent disclose a kind of " public transport road based on traffic multisource data fusion Condition processing system and method ", the traffic data that it merges the description public transport road conditions from different data sources obtain for displaying Traffic information.Its input is special traffic data, Credibility judgement is not carried out according to predicate, also not according to data sum number The confidence level of data source is calculated according to the relationship between source.
No. 201110369877 application documents of Chinese patent disclose " a kind of multi-source data integration platform and its structure side Method ", it is managed to different data, and consistency problem is not present between these data.
No. 8190546 application documents of United States Patent (USP) US disclose " Dependency between sources in Truth discovery ", it by the copy relationship between data source establishes probability graph model to assess data source and data Confidence level is not related to assess the confidence level of data with predicate.
Invention content
Goal of the invention:In order to overcome at present in multisource data fusion, the inconsistent problem of the data of identical entity is described, Namely be difficult to determine data reliability initial value in multi-source data consistency problem, and how to combine data source confidence level and The problem of data reliability, the present invention provide a kind of multi-source data collection cleaning side based on data source confidence level and data reliability Method calculates data source confidence level by setting Predicate evaluation data reliability, then by data reliability, finally finds out confidence level Highest data, achieve the purpose that data cleansing.
Technical solution:To realize that above-mentioned technique effect, the present invention propose a kind of multi-source data collection cleaning side based on predicate Method, including step:
(1) predicate model is built:Define priority predicate, state predicate and interaction predicate;Wherein,
Priority predicate is Prior (Ai, Aj), indicate attribute AiPriority be higher than attribute AjPriority;
State predicate is:Wherein, tiIndicate sentence i,Indicate attribute A in sentence ikAttribute value,Indicate predefinedWithBetween the condition that meets, φ(ti, tj) indicate predefined tiWith tjBetween the condition that meets;Stat(Ak) indicate to work as tiAnd tjWhen meeting condition P and φ, ti Quality be higher than tj
Interacting predicate is:Interδ(A1..., Al), it indicates when data meet condition δ, the attribute A of the data1..., AlAttribute value it is of poor quality;
(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained in data set Priority predicate, state predicate and interaction predicate;
(3) the attribute value confidence level of each data, including step are concentrated according to obtained predicate derived data:
It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is each attribute value of each data It is a constant that impact factor η, η, which is arranged,;
The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value when update, is first used The update of state predicate is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again;
Use state predicate updates the data the step of confidence level of each attribute value and is:Two numbers of data concentration are enumerated two-by-two According to tiAnd tjIf tiAnd tjIn attribute AkOn meet state predicate: Then by attribute valueConfidence level subtract η;
It is with the step of predicate updates the data the confidence level of each attribute value is interacted:All data that ergodic data is concentrated, If a data meets some interaction predicate Interδ(A1..., Al), then by data attribute A1..., AlAttribute value Confidence level subtracts η;
(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update, According to the sequence of priority from high to low successively execution priority predicate;
Execution priority predicate Prior (Ai, Aj) the step of be:If a plurality of data are in attribute AjOn attribute value it is credible Spend it is identical, then by them according to AiAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th The A of datajAttribute value confidence level on add n-1;
After (3-4) obtains the confidence level of all properties value, for multi-valued attribute, all confidence levels are returned more than or equal to default The attribute value of threshold value is as a result;For only needing the attribute of one result of return, step (4) to (6) is executed;
(4) confidence level of all properties value is normalized;According to formulaCalculating waits for clearly Wash the confidence level of all data sources in data set;Wherein, λiIndicate data source DiConfidence level, t indicates data source DiIn one Data, d (t) indicate that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level;
(5) according to formulaUpdate the confidence level of each attribute value, D ' expressions pair In attribute AjAttribute value is providedData source;Return to step (4) after update;
(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains;For need to only return to a knot The attribute of fruit, it is final result to find out the highest attribute value of confidence level under the attribute.
Further, the definition method of the priority predicate is:For attribute AiAnd AjIf meeting pscore(Ai) < pscore(Aj), then define priority predicate Prior (Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjIt is excellent First grade pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute Ai All properties value in null values ratio.
Further, the state predicate and interaction predicate are obtained by first order logic predicate method for digging.
Further, before being cleaned to data set, handmarking is carried out for all properties of all data sets, Mark each attribute need return a result or it is multiple as a result, if an attribute need to only return one as a result, if mark The attribute is single-value attribute, by the highest attribute value of confidence level under the attribute is final result when cleaning;If an attribute can Can exist multiple as a result, it is multi-valued attribute then to mark the attribute, confidence level under the attribute is more than to when cleaning the institute of predetermined threshold value It is final result to have attribute value.
Advantageous effect:Compared with prior art, the present invention has the advantage that:
Without assuming an attribute, only there are one right values to exist, and also not dependent on crowdsourcing, is not necessarily to a large amount of manual interventions, profit Relationship between the predicate and data set and attribute value that are gone out with automatic mining finds out attribute value with a high credibility.The present invention passes through digging It digs self-defined predicate and carrys out reliability scoring to attribute value, the attribute with a high credibility in predetermined threshold value is found out for more answer attributes Value is as a result, for being left attribute, in conjunction with the further Update attribute value of relationship of data source confidence level and attribute value confidence level Confidence level, find the highest attribute value of confidence level as a result, to improve data analysis efficiency and data analysis accuracy It is of great significance.Technical solution using the present invention, engineering staff can relatively easily realize related software.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the calculation process schematic diagram that attribute value confidence level in source is updated the data in the present invention.
Specific implementation mode
The present invention is further described below in conjunction with the accompanying drawings.
Fig. 1 show the flow chart of the present invention, and the invention mainly comprises following components:
A) three kinds of predicates are defined first:
1) priority predicate:For attribute AiAnd AjIf pscore(Ai) < pscore(Aj), then define a priority meaning Word Prior (Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjPriority pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute AiAll properties value The ratio of middle null values.
H(Ai) calculation formula be:H(Ai)=- ∑x∈Xp(x)log2P (x), X are attribute AiThe codomain of attribute value, p (x) Represent the proportion (not including null values) that attribute value x accounts for all properties value.
2) state predicate:State predicate is first order logic predicate, and form is:
Indicate tiAnd tjMeet condition P and φ, then tiQuality be higher than tj
Condition in above-mentioned definitionAnd fi(v1, v2) can be by v1=v2 Or v1≠v2It replaces.6 predicates being predefined of P during state predicate defines are replaced, and are P respectively1(v1, v2)、P2 (v1, v2)、P3(v1, v2)、P4(v1, v2)、P5(v1, v2)、P6(v1, v2)。P1(v1, v2)、P2(v1, v2) it is suitable for value type Attribute value, P1(v1, v2) indicate v1Compare v2Greatly, P2(v1, v2) indicate v1Compare v2It is small;P3(v1, v2)、P4(v1, v2) it is suitable for character type The attribute value of type, P3(v1, v2) indicate v1Compare v2It is long, P4(v1, v2) indicate v1Compare v2It is short;P5(v1, v2)、P6(v1, v2) it is suitable for word The attribute value of type is accorded with, character string is represented in more detail it includes more information, and metric form is with Shannon entropy formula comparison two The information content that a character string includes, P5(v1, v2) indicate v1Compare v2In more detail, P6(v1, v2) indicate v1Compare v2It is simpler.
3) interaction predicate:Interaction predicate is first order logic predicate, form Interδ(A1..., Al), it indicates when one Data meet condition δ, then the A of the data1..., AlAttribute value is of poor quality.
In above-mentioned definitionWherein Pi' can be by P1~P6Arbitrarily Predicate is replaced, while can also be replaced by following 4 predicates:P7(v1, v2)、P8(v1, v2)、P9(v1, v2)、P10(v1, v2);Its In, P7(v1, v2)、P8(v1, v2) it is suitable for the attribute value of character types, P7(v1, v2) indicate v1Including v2, P8(v1, v2) indicate v1 Not comprising v2;P9(v1, v2)、P10(v1, v2) it is suitable for the attribute value of character types and value type, P9(v1, v2) indicate v1It is equal to v2, P10(v1, v2) indicate v1Not equal to v2
Then predicate excavation is carried out according to data set:It, can be according to formula for priority predicateThe priority for calculating data set all properties obtains;For state predicate and interaction predicate, according to it The definition of first order logic predicate is automatically obtained by single order Inductive Learning.After obtaining state predicate and interaction predicate, in order to The availability of predicate is further increased, domain expert can be asked to remove invalid predicate and obtain final available predicate;
B) confidence level of initialization all properties value first is 0, and artificial arrange parameter η (impact factor) is a reality Number.Then the confidence level that three classes predicate derives each entity attributes value in data set is executed in the following order;
1) use state predicate:Data are enumerated two-by-two and concentrate two datas, if this two data meets some state meaning WordThen willConfidence level subtract η.
2) with interaction predicate:All data that ergodic data is concentrated, if a data meets some interaction predicate Interδ(A1..., Al), then by data A1..., AlAttribute value confidence level subtracts η.
3) priority predicate is used:Due to the priority of attribute be byIt determines, in order to It is newest to make the confidence level of current property value, need to first carry out the high priority predicate of priority, i.e., contained two attributes pscoreThe sum of smaller priority predicate.Priority predicate Prior (Ai, Aj) function be, if two data t1、t2Meet ConditionAttribute A can be passed through at this timeiConfidence level judgeWithWhich is good.Method is, for It is all in AjThe identical multiple data of upper confidence level, by them according to AiConfidence level make ascending sort, according to suitable after sequence Sequence, in the A for the data for coming n-thjAttribute value confidence level on add n-1, thus according to the higher A of priorityiIt distinguishes Confidence level identical AjValue.The attribute for needing to return multiple results is paid attention to simultaneously, if confidence level is negative, nothing Priority predicate need to be used.
After obtaining the confidence level of all properties value, for needing to return the attribute of multiple results, it is big to return to all confidence levels In the attribute value equal to predetermined threshold value as a result, for need return a result attribute, continue following steps.
C) basisCalculate data source confidence level, and to the confidence level of all data sources into Row normalization, i.e. ∑iλi=1, as shown in Figure 2.It uses againUpdate each attribute value Confidence level, the confidence level of each attribute value, which is equal to, to be provided the sum of the confidence level of data source of the attribute value to be multiplied by oneself original Confidence level notices that the data source confidence level of null values is exactly the confidence level in data source, does not include other offer null values The confidence level of data source.Then, equally the confidence level of all properties value is normalized so that for any one attribute, It is possible to the confidence level for the value got and is 1.It steps be repeated alternatively until the confidence level of data source and each attribute value Confidence level restrains.
D) finally, for the attribute that need to only return to a result, it is most to find out the highest attribute value of confidence level under the attribute Eventually as a result, in conjunction with the result in b) as final result.
Below in conjunction with specific sample, illustrate embodiments of the present invention:
We enable a data source DiConfidence level be λi, which is combined into { A1..., An, and t ∈ DiFor this The a data of data source, whereinRepresent t corresponding AsjAttribute value.D (t) is enabled to indicate the confidence level of the data again,Indicate attribute valueConfidence level.The confidence level of a data be equal to the data all properties value confidence level it With that is,:
And the confidence level of a data source be equal to it includes all data confidence level average value, i.e., all data The number of the sum of confidence level divided by data:
Meanwhile it is for attribute A to enable D 'jAttribute value is providedData source, then attribute valueConfidence level be all The sum of the confidence level of data source for providing the value is multiplied by oneself original confidence level:
Embodiment:Cleaning data set is as shown in the table, and one shares 5 datas and 5 data sources, wherein tiFrom data source Di, describe the scientific research personnel's data for being named as Mary.
Clean data set table
First, data set is simply first observed, does an artificial pretreatment, remove some apparent unreasonable data, So that subsequent data cleansing is more efficient, effect is more preferable.
T above such as in data set5This data, salary are negative value, hence it is evident that it is unreasonable, and in this data The value of subsequent these three attributes of Research Area, Affiliation and Publication does not make much sense yet, therefore It can be by t5This data is regarded as noise and directly deletes, and is not involved in subsequent cleaning operation.
T is seen again4The value of this data, this attribute of its publication is "-", this also belongs to unreasonable number According to, but because t4The value of other attributes of this data still has reference value, therefore can directly be changed to the value of publication “null”。
After above-mentioned simple pretreatment, obtained data set is as follows:
The first step carries out predicate excavation to data set.
Excavate priority predicate
For priority predicate, the ratio of the entropy and null values of all properties is counted.
Entropy calculation formula:H(Ai)=- ∑x∈Xp(x)log2p(x)
Wherein, p (x) represents the proportion (not including null values) that attribute value x accounts for all properties value;
By taking salary as an example, 3 attribute values are shared, are 142k, 120k and 88k respectively.
Wherein,The entropy that attribute salary can then be obtained is:
Similarly:
pn(Salary)=pn(ResearchArea)=0
It can obtain:
Three priority predicates are can define according to the above-mentioned relationship that is less than:
Excavation state predicate:
It is automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner:
Excavate interaction predicate:
Equally automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner:
Second step derives the confidence level of each entity attributes value.
The confidence level for initializing all properties value is 0, setting impact factor η=1, for needing to return the category of multiple results Property, if it is 0 that confidence level, which sets threshold value,.Then using corresponding predicate in sequence, (different predicated execution sequences may Generate different results).
State predicate and interaction predicate are all to act on attribute value, and attribute value will not change, therefore state predicate and shape Between state predicate, interaction predicate with interact between predicate and state predicate and interaction predicate between mutually independence, can be with Random order is called.
However, priority predicate acts on the confidence level of attribute value, therefore priority predicate must be stateful in institute It is used after predicate and interaction predicate.
In addition to this, certain sequence is also had to comply between priority predicate and priority predicate.In order to make currently to belong to Property value confidence level be newest, the high priority predicate of priority need to be first carried out, i.e., the sum of the priority of contained two attributes Smaller priority predicate.For cleaning the priority predicate of data set table,PscoreThe sum of be 3.5,PscoreThe sum of Also it is 3.5,PscoreThe sum of be 4.11, so should be according toSequence execution priority predicate.
After executing state predicate, the confidence level of all properties value is as shown in table 1.Here there are 4 datas, need to carry out two-by-two Compare, has carried out 16 comparisons altogether.With t1And t2For, according to state predicateBecauseSo willSubtract 1.
Table 1
Salary Research Area Affiliation Publication
t1(D1) 0 0 0 0
t2(D2) -1 0 0 0
t3(D3) -2 0 0 0
t4(D4) -2 0 0 0
After executing interaction predicate, the confidence level of all properties value is as shown in table 2.Herein according to interaction predicateWith The attribute value confidence level that all Affiliation and Publication values are null is subtracted 1.
Table 2
Salary Research Area Affiliation Publication
t1(D1) 0 0 0 0
t2(D2) -1 0 0 -1
t3(D3) -2 0 0 0
t4(D4) -2 0 -1 -1
After execution priority predicate, the confidence level of all properties value is as shown in table 3.By taking Research Area as an example, just Beginning Research Area row value be 0,0,0,0], according to priority predicateIt can be with According to Salary row 0, -1, -2, -2) update the value of Research Area row.By confidence level in Research Area row Identical value is resequenced according to the ascending order that Salary is arranged, and adds 0,1,2 after sequence respectively, obtained after reduction sequence 2,1,0, 0].Similarly execution priority predicate againWith
Table 3
Salary Research Area Affiliation Publication
t1(D1) 0 2 2 1
t2(D2) -1 1 1 -1
t3(D3) -2 0 0 0
t4(D4) -2 0 -1 -1
Then all properties are marked, scientific research personnel's same time can only there are one wages and one to be subordinate to machine Structure, so Salary and Affiliation only return to one as a result, still the research field of scientific research personnel and works be but Can have multiple, therefore ResearchArea and Publication attributes should return to multiple end values.For multi-valued attribute, The all properties value returned at this time more than or equal to threshold value 0 returns the result that is, for Research Area as { Data Integration, data cleaning Data cleaning&Google Knowledge management Information retrieve }, for Publication, return the result as { Data integration, A diagnostic tool for data errors}。
Third walks, and calculates data source confidence level.
Then pass throughThe confidence value of all properties value is mapped to (0,1), and is normalized, as a result As shown in table 4.And according toCalculate the confidence level of all data sources.
Table 4
Salary Research Area Affiliation Publication λ
t1(D1) 0.496353 0.33723 0.369959 0.413275 0.404204
t2(D2) 0.26698 0.2799 0.307065 0.152035 0.251495
t3(D3) 0.118333 0.191435 0.210014 0.282655 0.200609
t4(D4) 0.118333 0.191435 0.112963 0.152035 0.143692
Finally byWithTwo formulas change The confidence level of the update all properties value in generation is restrained up to the confidence level of all properties value, is paid attention to every time by row Update attribute value It needs to normalize the row confidence level after confidence level.It is credible for the attribute value of Salary row by taking first time renewal process as an example Degree:
0.496353,0.26698,0.118333,0.118333]
→ 0.496353 × 0.404204,0.26698 × 0.251495,0.118333 × 0.344301,0.118333 × 0.344301]
→ 0.200628,0.0671441,0.0407422,0.0407422)
Similarly, for the attribute value confidence level of Research Area row:
0.33723,0.2799,0.191435,0.191435) → 0.500009,0.258216,0.140871, 0.100903]
For the attribute value confidence level of Affiliation row:
0.369959,0.307065,0.210014,0.112963) → 0.404204,0.251495,0.200609, 0.143692)
For the attribute value confidence level of Publication row:
{ 0.413275,0.152035,0.282655,0.152035 } → 0.588542,0.134713,0.199777, 0.0769686]
The confidence level of final updating data source:
λ=0.5168,0.209168,0.164478,0.109554)
It repeats the above process until convergence, final result are as shown in table 5.
Table 5
Salary Research Area Affiliation Publication λ
t1(D1) 1 1 1 1 1
t2(D2) 0 0 0 0 0
t3(D3) 0 0 0 0 0
t4(D4) 0 0 0 0 0
4th step, obtains a result.
According to table 5, attribute value best in Salary and Affiliation attributes can be selected, you can the maximum category of reliability Property value be result.Wherein the result of Salary is { 142k }, and the result of Affiliation is { Amazon }.
The above is only a preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (4)

1. a kind of multi-source data collection cleaning method based on predicate, which is characterized in that including step:
(1) predicate model is built:Define priority predicate, state predicate and interaction predicate;Wherein,
Priority predicate is Prior (Ai, Aj), indicate attribute AiPriority be higher than attribute AjPriority;
State predicate is:Wherein, tiIndicate sentence i,Table Show attribute A in sentence ikAttribute value,Indicate predefinedWithBetween the condition that meets, φ (ti, tj) indicate predefined tiWith tjBetween the condition that meets;Stat(Ak) indicate to work as tiAnd tjWhen meeting condition P and φ, tiMatter Amount is higher than tj
Interacting predicate is:Interδ(A1..., Al), it indicates when data meet condition δ, the attribute A of the data1..., Al's Attribute value is of poor quality;
(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained excellent in data set First grade predicate, state predicate and interaction predicate;
(3) the attribute value confidence level of each data, including step are concentrated according to obtained predicate derived data:
It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is arranged for each attribute value of each data Impact factor η, η are a constant;
The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value, when update, first use state Predicate update is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again;
Use state predicate updates the data the step of confidence level of each attribute value and is:Two data t of data concentration are enumerated two-by-twoi And tjIf tiAnd tjIn attribute AkOn meet state predicate:Then will Attribute valueConfidence level subtract η;
It is with the step of predicate updates the data the confidence level of each attribute value is interacted:All data that ergodic data is concentrated, if A data meets some interaction predicate Interδ(A1..., Al), then by data attribute A1..., AlAttribute value it is credible Degree subtracts η;
(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update, according to The sequence of priority from high to low execution priority predicate successively;
Execution priority predicate Prior (Ai, Aj) the step of be:If a plurality of data are in attribute AjOn attribute value confidence level phase Together, then by them according to AiAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th data AjAttribute value confidence level on add n-1;
After (3-4) obtain the confidence level of all properties value, for multi-valued attribute, returns to all confidence levels and be more than or equal to default threshold The attribute value of value is as a result;For only needing the attribute of one result of return, step (4) to (6) is executed;
(4) confidence level of all properties value is normalized;According to formulaCalculate number to be cleaned According to the confidence level for concentrating all data sources;Wherein, λiIndicate data source DiConfidence level, t indicates data source DiIn a number According to d (t) indicates that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level;
(5) according to formulaThe confidence level of each attribute value is updated, D ' expressions are for belonging to Property AjAttribute value is providedData source;Return to step (4) after update;
(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains;For need to only return to result Attribute, it is final result to find out the highest attribute value of confidence level under the attribute.
2. a kind of multi-source data collection cleaning method based on predicate according to claim 1, which is characterized in that described preferential Grade predicate definition method be:For attribute AiAnd AjIf meeting pscore(Ai) < pscore(Aj), then define priority predicate Prior(Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjPriority pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute AiAll properties value The ratio of middle null values.
3. a kind of multi-source data collection cleaning method based on predicate according to claim 2, which is characterized in that the state Predicate and interaction predicate are obtained by first order logic predicate method for digging.
4. a kind of multi-source data collection cleaning method based on predicate according to claim 3, which is characterized in that data Before collection is cleaned, handmarking is carried out for all properties of all data sets, each attribute is marked to need to return to one As a result still multiple as a result, if an attribute need to only return one as a result, if mark the attribute to be single-value attribute, will when cleaning The highest attribute value of confidence level is final result under the attribute;If an attribute there may be it is multiple as a result, if mark the category Property be multi-valued attribute, all properties value that confidence level under the attribute is more than to when cleaning predetermined threshold value is final result.
CN201810578708.3A 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates Active CN108776697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810578708.3A CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810578708.3A CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Publications (2)

Publication Number Publication Date
CN108776697A true CN108776697A (en) 2018-11-09
CN108776697B CN108776697B (en) 2020-06-09

Family

ID=64024668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810578708.3A Active CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Country Status (1)

Country Link
CN (1) CN108776697B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582906A (en) * 2018-11-30 2019-04-05 北京锐安科技有限公司 Determination method, apparatus, equipment and the storage medium of data reliability

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811772A (en) * 2005-01-25 2006-08-02 翁托普里塞有限公司 Integration platform for heterogeneous information sources
US20090327255A1 (en) * 2008-06-26 2009-12-31 Microsoft Corporation View matching of materialized xml views
CN105045807A (en) * 2015-06-04 2015-11-11 浙江力石科技股份有限公司 Data cleaning algorithm based on Internet trading information
CN105279232A (en) * 2015-09-22 2016-01-27 武汉开目信息技术有限责任公司 Method for showing screening and classification of data set in PDM (Product Data Management) system
CN105608228A (en) * 2016-01-29 2016-05-25 中国科学院计算机网络信息中心 High-efficiency distributed RDF data storage method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811772A (en) * 2005-01-25 2006-08-02 翁托普里塞有限公司 Integration platform for heterogeneous information sources
US20090327255A1 (en) * 2008-06-26 2009-12-31 Microsoft Corporation View matching of materialized xml views
CN105045807A (en) * 2015-06-04 2015-11-11 浙江力石科技股份有限公司 Data cleaning algorithm based on Internet trading information
CN105279232A (en) * 2015-09-22 2016-01-27 武汉开目信息技术有限责任公司 Method for showing screening and classification of data set in PDM (Product Data Management) system
CN105608228A (en) * 2016-01-29 2016-05-25 中国科学院计算机网络信息中心 High-efficiency distributed RDF data storage method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582906A (en) * 2018-11-30 2019-04-05 北京锐安科技有限公司 Determination method, apparatus, equipment and the storage medium of data reliability

Also Published As

Publication number Publication date
CN108776697B (en) 2020-06-09

Similar Documents

Publication Publication Date Title
Karthikeyan et al. A survey on association rule mining
Dror et al. Replicability analysis for natural language processing: Testing significance with multiple datasets
Petry Fuzzy databases: principles and applications
Korth et al. System/U: a database system based on the universal relation assumption
Pernelle et al. An automatic key discovery approach for data linking
Völker et al. Automatic acquisition of class disjointness
San Martın et al. SNQL: A social networks query and transformation language
Varga et al. Conceptual design of document NoSQL database with formal concept analysis
CN107391542A (en) A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates
Nguyen et al. Efficient mining of class association rules with the itemset constraint
Anam et al. Adapting a knowledge-based schema matching system for ontology mapping
CN103838804A (en) Social network user interest association rule mining method based on community division
Hong et al. Mining rules from an incomplete dataset with a high missing rate
Louhdi et al. Transformation rules for building owl ontologies from relational databases
CN108776697A (en) A kind of multi-source data collection cleaning method based on predicate
CN103294791A (en) Extensible markup language pattern matching method
CN106547877B (en) Data element Smart Logo analytic method based on 6W service logic model
Khan et al. Logics for information systems and their dynamic extensions
Chen et al. Join cardinality estimation by combining operator-level deep neural networks
Bogorny et al. Semantic-based pruning of redundant and uninteresting frequent geographic patterns
Niepert et al. Probabilistic-logical web data integration
Kern et al. A framework for building logical schema and query decomposition in data warehouse federations
Xie et al. Instance-driven ontology evolution mechanism towards enterprise data management
Liang et al. Mining social ties beyond homophily
Jaleel et al. Ontology construction from relational database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant