CN108776697A - A kind of multi-source data collection cleaning method based on predicate - Google Patents
A kind of multi-source data collection cleaning method based on predicate Download PDFInfo
- Publication number
- CN108776697A CN108776697A CN201810578708.3A CN201810578708A CN108776697A CN 108776697 A CN108776697 A CN 108776697A CN 201810578708 A CN201810578708 A CN 201810578708A CN 108776697 A CN108776697 A CN 108776697A
- Authority
- CN
- China
- Prior art keywords
- data
- predicate
- attribute
- confidence level
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The present invention proposes that a kind of method that the multi-source data collection cleaning method based on predicate is provided effectively can identify most reliable data item from isomorphism multi-source data concentration, is related to the fields such as data cleansing, data fusion.The method includes:1) predicate is excavated with automated method, and the predicate excavated is filtered;2) confidence level of each entity attributes value is concentrated according to predicate derived data;3) attribute value confidence level is established with the relationship between data source confidence level, calculates data source confidence level;4) data source confidence level and attribute value confidence level is combined to find out the highest data item of confidence level.For multiple data sources, the present invention can filter out redundancy, mistake and out-of-date data, leave the highest data of confidence level to coming from different data sources but the identical information of content is analyzed, for subsequent data analysis lay a good foundation, it is of great significance to the efficiency and accuracy rate of follow-up data processing.
Description
Technical field
The present invention relates to fields such as data cleansing, data fusions, especially a kind of multi-source data collection cleaning based on predicate
Method.
Background technology
In the information age, the description data to the same event or object can be found from a large amount of data source, together
When due to timing error, format error, accuracy, integrality, ambiguity semantically etc., from different data sources to same
There are inconsistencies for the description of entity.After different data sources gather data, solve to belong to same entity description data it
Between inconsistency, it is most important to subsequent data analysis.Simple temporal voting strategy --- select more data source to support
Description --- be not particularly suited for Web environment instantly, and need to consider data source confidence level, the confidence level of data itself and some
Priori designs more complicated cleaning strategy.Existing cleaning strategy includes mainly following several:
No. 201410387772 application documents of Chinese patent disclose a kind of " public transport road based on traffic multisource data fusion
Condition processing system and method ", the traffic data that it merges the description public transport road conditions from different data sources obtain for displaying
Traffic information.Its input is special traffic data, Credibility judgement is not carried out according to predicate, also not according to data sum number
The confidence level of data source is calculated according to the relationship between source.
No. 201110369877 application documents of Chinese patent disclose " a kind of multi-source data integration platform and its structure side
Method ", it is managed to different data, and consistency problem is not present between these data.
No. 8190546 application documents of United States Patent (USP) US disclose " Dependency between sources in
Truth discovery ", it by the copy relationship between data source establishes probability graph model to assess data source and data
Confidence level is not related to assess the confidence level of data with predicate.
Invention content
Goal of the invention:In order to overcome at present in multisource data fusion, the inconsistent problem of the data of identical entity is described,
Namely be difficult to determine data reliability initial value in multi-source data consistency problem, and how to combine data source confidence level and
The problem of data reliability, the present invention provide a kind of multi-source data collection cleaning side based on data source confidence level and data reliability
Method calculates data source confidence level by setting Predicate evaluation data reliability, then by data reliability, finally finds out confidence level
Highest data, achieve the purpose that data cleansing.
Technical solution:To realize that above-mentioned technique effect, the present invention propose a kind of multi-source data collection cleaning side based on predicate
Method, including step:
(1) predicate model is built:Define priority predicate, state predicate and interaction predicate;Wherein,
Priority predicate is Prior (Ai, Aj), indicate attribute AiPriority be higher than attribute AjPriority;
State predicate is:Wherein, tiIndicate sentence i,Indicate attribute A in sentence ikAttribute value,Indicate predefinedWithBetween the condition that meets,
φ(ti, tj) indicate predefined tiWith tjBetween the condition that meets;Stat(Ak) indicate to work as tiAnd tjWhen meeting condition P and φ, ti
Quality be higher than tj;
Interacting predicate is:Interδ(A1..., Al), it indicates when data meet condition δ, the attribute A of the data1...,
AlAttribute value it is of poor quality;
(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained in data set
Priority predicate, state predicate and interaction predicate;
(3) the attribute value confidence level of each data, including step are concentrated according to obtained predicate derived data:
It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is each attribute value of each data
It is a constant that impact factor η, η, which is arranged,;
The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value when update, is first used
The update of state predicate is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again;
Use state predicate updates the data the step of confidence level of each attribute value and is:Two numbers of data concentration are enumerated two-by-two
According to tiAnd tjIf tiAnd tjIn attribute AkOn meet state predicate:
Then by attribute valueConfidence level subtract η;
It is with the step of predicate updates the data the confidence level of each attribute value is interacted:All data that ergodic data is concentrated,
If a data meets some interaction predicate Interδ(A1..., Al), then by data attribute A1..., AlAttribute value
Confidence level subtracts η;
(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update,
According to the sequence of priority from high to low successively execution priority predicate;
Execution priority predicate Prior (Ai, Aj) the step of be:If a plurality of data are in attribute AjOn attribute value it is credible
Spend it is identical, then by them according to AiAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th
The A of datajAttribute value confidence level on add n-1;
After (3-4) obtains the confidence level of all properties value, for multi-valued attribute, all confidence levels are returned more than or equal to default
The attribute value of threshold value is as a result;For only needing the attribute of one result of return, step (4) to (6) is executed;
(4) confidence level of all properties value is normalized;According to formulaCalculating waits for clearly
Wash the confidence level of all data sources in data set;Wherein, λiIndicate data source DiConfidence level, t indicates data source DiIn one
Data, d (t) indicate that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level;
(5) according to formulaUpdate the confidence level of each attribute value, D ' expressions pair
In attribute AjAttribute value is providedData source;Return to step (4) after update;
(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains;For need to only return to a knot
The attribute of fruit, it is final result to find out the highest attribute value of confidence level under the attribute.
Further, the definition method of the priority predicate is:For attribute AiAnd AjIf meeting pscore(Ai) <
pscore(Aj), then define priority predicate Prior (Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjIt is excellent
First grade pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute Ai
All properties value in null values ratio.
Further, the state predicate and interaction predicate are obtained by first order logic predicate method for digging.
Further, before being cleaned to data set, handmarking is carried out for all properties of all data sets,
Mark each attribute need return a result or it is multiple as a result, if an attribute need to only return one as a result, if mark
The attribute is single-value attribute, by the highest attribute value of confidence level under the attribute is final result when cleaning;If an attribute can
Can exist multiple as a result, it is multi-valued attribute then to mark the attribute, confidence level under the attribute is more than to when cleaning the institute of predetermined threshold value
It is final result to have attribute value.
Advantageous effect:Compared with prior art, the present invention has the advantage that:
Without assuming an attribute, only there are one right values to exist, and also not dependent on crowdsourcing, is not necessarily to a large amount of manual interventions, profit
Relationship between the predicate and data set and attribute value that are gone out with automatic mining finds out attribute value with a high credibility.The present invention passes through digging
It digs self-defined predicate and carrys out reliability scoring to attribute value, the attribute with a high credibility in predetermined threshold value is found out for more answer attributes
Value is as a result, for being left attribute, in conjunction with the further Update attribute value of relationship of data source confidence level and attribute value confidence level
Confidence level, find the highest attribute value of confidence level as a result, to improve data analysis efficiency and data analysis accuracy
It is of great significance.Technical solution using the present invention, engineering staff can relatively easily realize related software.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the calculation process schematic diagram that attribute value confidence level in source is updated the data in the present invention.
Specific implementation mode
The present invention is further described below in conjunction with the accompanying drawings.
Fig. 1 show the flow chart of the present invention, and the invention mainly comprises following components:
A) three kinds of predicates are defined first:
1) priority predicate:For attribute AiAnd AjIf pscore(Ai) < pscore(Aj), then define a priority meaning
Word Prior (Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjPriority pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute AiAll properties value
The ratio of middle null values.
H(Ai) calculation formula be:H(Ai)=- ∑x∈Xp(x)log2P (x), X are attribute AiThe codomain of attribute value, p (x)
Represent the proportion (not including null values) that attribute value x accounts for all properties value.
2) state predicate:State predicate is first order logic predicate, and form is:
Indicate tiAnd tjMeet condition P and φ, then tiQuality be higher than tj。
Condition in above-mentioned definitionAnd fi(v1, v2) can be by v1=v2
Or v1≠v2It replaces.6 predicates being predefined of P during state predicate defines are replaced, and are P respectively1(v1, v2)、P2
(v1, v2)、P3(v1, v2)、P4(v1, v2)、P5(v1, v2)、P6(v1, v2)。P1(v1, v2)、P2(v1, v2) it is suitable for value type
Attribute value, P1(v1, v2) indicate v1Compare v2Greatly, P2(v1, v2) indicate v1Compare v2It is small;P3(v1, v2)、P4(v1, v2) it is suitable for character type
The attribute value of type, P3(v1, v2) indicate v1Compare v2It is long, P4(v1, v2) indicate v1Compare v2It is short;P5(v1, v2)、P6(v1, v2) it is suitable for word
The attribute value of type is accorded with, character string is represented in more detail it includes more information, and metric form is with Shannon entropy formula comparison two
The information content that a character string includes, P5(v1, v2) indicate v1Compare v2In more detail, P6(v1, v2) indicate v1Compare v2It is simpler.
3) interaction predicate:Interaction predicate is first order logic predicate, form Interδ(A1..., Al), it indicates when one
Data meet condition δ, then the A of the data1..., AlAttribute value is of poor quality.
In above-mentioned definitionWherein Pi' can be by P1~P6Arbitrarily
Predicate is replaced, while can also be replaced by following 4 predicates:P7(v1, v2)、P8(v1, v2)、P9(v1, v2)、P10(v1, v2);Its
In, P7(v1, v2)、P8(v1, v2) it is suitable for the attribute value of character types, P7(v1, v2) indicate v1Including v2, P8(v1, v2) indicate v1
Not comprising v2;P9(v1, v2)、P10(v1, v2) it is suitable for the attribute value of character types and value type, P9(v1, v2) indicate v1It is equal to
v2, P10(v1, v2) indicate v1Not equal to v2。
Then predicate excavation is carried out according to data set:It, can be according to formula for priority predicateThe priority for calculating data set all properties obtains;For state predicate and interaction predicate, according to it
The definition of first order logic predicate is automatically obtained by single order Inductive Learning.After obtaining state predicate and interaction predicate, in order to
The availability of predicate is further increased, domain expert can be asked to remove invalid predicate and obtain final available predicate;
B) confidence level of initialization all properties value first is 0, and artificial arrange parameter η (impact factor) is a reality
Number.Then the confidence level that three classes predicate derives each entity attributes value in data set is executed in the following order;
1) use state predicate:Data are enumerated two-by-two and concentrate two datas, if this two data meets some state meaning
WordThen willConfidence level subtract η.
2) with interaction predicate:All data that ergodic data is concentrated, if a data meets some interaction predicate
Interδ(A1..., Al), then by data A1..., AlAttribute value confidence level subtracts η.
3) priority predicate is used:Due to the priority of attribute be byIt determines, in order to
It is newest to make the confidence level of current property value, need to first carry out the high priority predicate of priority, i.e., contained two attributes
pscoreThe sum of smaller priority predicate.Priority predicate Prior (Ai, Aj) function be, if two data t1、t2Meet
ConditionAttribute A can be passed through at this timeiConfidence level judgeWithWhich is good.Method is, for
It is all in AjThe identical multiple data of upper confidence level, by them according to AiConfidence level make ascending sort, according to suitable after sequence
Sequence, in the A for the data for coming n-thjAttribute value confidence level on add n-1, thus according to the higher A of priorityiIt distinguishes
Confidence level identical AjValue.The attribute for needing to return multiple results is paid attention to simultaneously, if confidence level is negative, nothing
Priority predicate need to be used.
After obtaining the confidence level of all properties value, for needing to return the attribute of multiple results, it is big to return to all confidence levels
In the attribute value equal to predetermined threshold value as a result, for need return a result attribute, continue following steps.
C) basisCalculate data source confidence level, and to the confidence level of all data sources into
Row normalization, i.e. ∑iλi=1, as shown in Figure 2.It uses againUpdate each attribute value
Confidence level, the confidence level of each attribute value, which is equal to, to be provided the sum of the confidence level of data source of the attribute value to be multiplied by oneself original
Confidence level notices that the data source confidence level of null values is exactly the confidence level in data source, does not include other offer null values
The confidence level of data source.Then, equally the confidence level of all properties value is normalized so that for any one attribute,
It is possible to the confidence level for the value got and is 1.It steps be repeated alternatively until the confidence level of data source and each attribute value
Confidence level restrains.
D) finally, for the attribute that need to only return to a result, it is most to find out the highest attribute value of confidence level under the attribute
Eventually as a result, in conjunction with the result in b) as final result.
Below in conjunction with specific sample, illustrate embodiments of the present invention:
We enable a data source DiConfidence level be λi, which is combined into { A1..., An, and t ∈ DiFor this
The a data of data source, whereinRepresent t corresponding AsjAttribute value.D (t) is enabled to indicate the confidence level of the data again,Indicate attribute valueConfidence level.The confidence level of a data be equal to the data all properties value confidence level it
With that is,:
And the confidence level of a data source be equal to it includes all data confidence level average value, i.e., all data
The number of the sum of confidence level divided by data:
Meanwhile it is for attribute A to enable D 'jAttribute value is providedData source, then attribute valueConfidence level be all
The sum of the confidence level of data source for providing the value is multiplied by oneself original confidence level:
Embodiment:Cleaning data set is as shown in the table, and one shares 5 datas and 5 data sources, wherein tiFrom data source
Di, describe the scientific research personnel's data for being named as Mary.
Clean data set table
First, data set is simply first observed, does an artificial pretreatment, remove some apparent unreasonable data,
So that subsequent data cleansing is more efficient, effect is more preferable.
T above such as in data set5This data, salary are negative value, hence it is evident that it is unreasonable, and in this data
The value of subsequent these three attributes of Research Area, Affiliation and Publication does not make much sense yet, therefore
It can be by t5This data is regarded as noise and directly deletes, and is not involved in subsequent cleaning operation.
T is seen again4The value of this data, this attribute of its publication is "-", this also belongs to unreasonable number
According to, but because t4The value of other attributes of this data still has reference value, therefore can directly be changed to the value of publication
“null”。
After above-mentioned simple pretreatment, obtained data set is as follows:
The first step carries out predicate excavation to data set.
Excavate priority predicate
For priority predicate, the ratio of the entropy and null values of all properties is counted.
Entropy calculation formula:H(Ai)=- ∑x∈Xp(x)log2p(x)
Wherein, p (x) represents the proportion (not including null values) that attribute value x accounts for all properties value;
By taking salary as an example, 3 attribute values are shared, are 142k, 120k and 88k respectively.
Wherein,The entropy that attribute salary can then be obtained is:
Similarly:
pn(Salary)=pn(ResearchArea)=0
It can obtain:
Three priority predicates are can define according to the above-mentioned relationship that is less than:
Excavation state predicate:
It is automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner:
Excavate interaction predicate:
Equally automatically obtained by first order logic predicate mining algorithm First Order Inductive Learner:
Second step derives the confidence level of each entity attributes value.
The confidence level for initializing all properties value is 0, setting impact factor η=1, for needing to return the category of multiple results
Property, if it is 0 that confidence level, which sets threshold value,.Then using corresponding predicate in sequence, (different predicated execution sequences may
Generate different results).
State predicate and interaction predicate are all to act on attribute value, and attribute value will not change, therefore state predicate and shape
Between state predicate, interaction predicate with interact between predicate and state predicate and interaction predicate between mutually independence, can be with
Random order is called.
However, priority predicate acts on the confidence level of attribute value, therefore priority predicate must be stateful in institute
It is used after predicate and interaction predicate.
In addition to this, certain sequence is also had to comply between priority predicate and priority predicate.In order to make currently to belong to
Property value confidence level be newest, the high priority predicate of priority need to be first carried out, i.e., the sum of the priority of contained two attributes
Smaller priority predicate.For cleaning the priority predicate of data set table,PscoreThe sum of be 3.5,PscoreThe sum of
Also it is 3.5,PscoreThe sum of be 4.11, so should be according toSequence execution priority predicate.
After executing state predicate, the confidence level of all properties value is as shown in table 1.Here there are 4 datas, need to carry out two-by-two
Compare, has carried out 16 comparisons altogether.With t1And t2For, according to state predicateBecauseSo willSubtract 1.
Table 1
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 0 | 0 | 0 |
t2(D2) | -1 | 0 | 0 | 0 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | 0 | 0 |
After executing interaction predicate, the confidence level of all properties value is as shown in table 2.Herein according to interaction predicateWith
The attribute value confidence level that all Affiliation and Publication values are null is subtracted 1.
Table 2
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 0 | 0 | 0 |
t2(D2) | -1 | 0 | 0 | -1 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | -1 | -1 |
After execution priority predicate, the confidence level of all properties value is as shown in table 3.By taking Research Area as an example, just
Beginning Research Area row value be 0,0,0,0], according to priority predicateIt can be with
According to Salary row 0, -1, -2, -2) update the value of Research Area row.By confidence level in Research Area row
Identical value is resequenced according to the ascending order that Salary is arranged, and adds 0,1,2 after sequence respectively, obtained after reduction sequence 2,1,0,
0].Similarly execution priority predicate againWith
Table 3
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 2 | 2 | 1 |
t2(D2) | -1 | 1 | 1 | -1 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | -1 | -1 |
Then all properties are marked, scientific research personnel's same time can only there are one wages and one to be subordinate to machine
Structure, so Salary and Affiliation only return to one as a result, still the research field of scientific research personnel and works be but
Can have multiple, therefore ResearchArea and Publication attributes should return to multiple end values.For multi-valued attribute,
The all properties value returned at this time more than or equal to threshold value 0 returns the result that is, for Research Area as { Data
Integration, data cleaning Data cleaning&Google Knowledge management
Information retrieve }, for Publication, return the result as { Data integration, A
diagnostic tool for data errors}。
Third walks, and calculates data source confidence level.
Then pass throughThe confidence value of all properties value is mapped to (0,1), and is normalized, as a result
As shown in table 4.And according toCalculate the confidence level of all data sources.
Table 4
Salary | Research Area | Affiliation | Publication | λ | |
t1(D1) | 0.496353 | 0.33723 | 0.369959 | 0.413275 | 0.404204 |
t2(D2) | 0.26698 | 0.2799 | 0.307065 | 0.152035 | 0.251495 |
t3(D3) | 0.118333 | 0.191435 | 0.210014 | 0.282655 | 0.200609 |
t4(D4) | 0.118333 | 0.191435 | 0.112963 | 0.152035 | 0.143692 |
Finally byWithTwo formulas change
The confidence level of the update all properties value in generation is restrained up to the confidence level of all properties value, is paid attention to every time by row Update attribute value
It needs to normalize the row confidence level after confidence level.It is credible for the attribute value of Salary row by taking first time renewal process as an example
Degree:
0.496353,0.26698,0.118333,0.118333]
→ 0.496353 × 0.404204,0.26698 × 0.251495,0.118333 × 0.344301,0.118333 ×
0.344301]
→ 0.200628,0.0671441,0.0407422,0.0407422)
Similarly, for the attribute value confidence level of Research Area row:
0.33723,0.2799,0.191435,0.191435) → 0.500009,0.258216,0.140871,
0.100903]
For the attribute value confidence level of Affiliation row:
0.369959,0.307065,0.210014,0.112963) → 0.404204,0.251495,0.200609,
0.143692)
For the attribute value confidence level of Publication row:
{ 0.413275,0.152035,0.282655,0.152035 } → 0.588542,0.134713,0.199777,
0.0769686]
The confidence level of final updating data source:
λ=0.5168,0.209168,0.164478,0.109554)
It repeats the above process until convergence, final result are as shown in table 5.
Table 5
Salary | Research Area | Affiliation | Publication | λ | |
t1(D1) | 1 | 1 | 1 | 1 | 1 |
t2(D2) | 0 | 0 | 0 | 0 | 0 |
t3(D3) | 0 | 0 | 0 | 0 | 0 |
t4(D4) | 0 | 0 | 0 | 0 | 0 |
4th step, obtains a result.
According to table 5, attribute value best in Salary and Affiliation attributes can be selected, you can the maximum category of reliability
Property value be result.Wherein the result of Salary is { 142k }, and the result of Affiliation is { Amazon }.
The above is only a preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (4)
1. a kind of multi-source data collection cleaning method based on predicate, which is characterized in that including step:
(1) predicate model is built:Define priority predicate, state predicate and interaction predicate;Wherein,
Priority predicate is Prior (Ai, Aj), indicate attribute AiPriority be higher than attribute AjPriority;
State predicate is:Wherein, tiIndicate sentence i,Table
Show attribute A in sentence ikAttribute value,Indicate predefinedWithBetween the condition that meets, φ (ti,
tj) indicate predefined tiWith tjBetween the condition that meets;Stat(Ak) indicate to work as tiAnd tjWhen meeting condition P and φ, tiMatter
Amount is higher than tj;
Interacting predicate is:Interδ(A1..., Al), it indicates when data meet condition δ, the attribute A of the data1..., Al's
Attribute value is of poor quality;
(2) predicate excavation is carried out to data set to be cleaned by the predicate model that step (1) defines, obtained excellent in data set
First grade predicate, state predicate and interaction predicate;
(3) the attribute value confidence level of each data, including step are concentrated according to obtained predicate derived data:
It is 0 that (3-1) initialization data, which concentrates all properties value confidence level of data, and is arranged for each attribute value of each data
Impact factor η, η are a constant;
The confidence level of (3-2) use state predicate and interaction predicate update per data each attribute value, when update, first use state
Predicate update is updated with interaction predicate again, or first with the update of interaction predicate, use state predicate updates again;
Use state predicate updates the data the step of confidence level of each attribute value and is:Two data t of data concentration are enumerated two-by-twoi
And tjIf tiAnd tjIn attribute AkOn meet state predicate:Then will
Attribute valueConfidence level subtract η;
It is with the step of predicate updates the data the confidence level of each attribute value is interacted:All data that ergodic data is concentrated, if
A data meets some interaction predicate Interδ(A1..., Al), then by data attribute A1..., AlAttribute value it is credible
Degree subtracts η;
(3-3) updates the attribute value confidence level per data after the completion of step (2), with priority predicate, when update, according to
The sequence of priority from high to low execution priority predicate successively;
Execution priority predicate Prior (Ai, Aj) the step of be:If a plurality of data are in attribute AjOn attribute value confidence level phase
Together, then by them according to AiAttribute value confidence level do ascending sort, according to the sequence after sequence, coming n-th data
AjAttribute value confidence level on add n-1;
After (3-4) obtain the confidence level of all properties value, for multi-valued attribute, returns to all confidence levels and be more than or equal to default threshold
The attribute value of value is as a result;For only needing the attribute of one result of return, step (4) to (6) is executed;
(4) confidence level of all properties value is normalized;According to formulaCalculate number to be cleaned
According to the confidence level for concentrating all data sources;Wherein, λiIndicate data source DiConfidence level, t indicates data source DiIn a number
According to d (t) indicates that the confidence level of data t, the confidence level of data t are equal to the sum of the data all properties value confidence level;
(5) according to formulaThe confidence level of each attribute value is updated, D ' expressions are for belonging to
Property AjAttribute value is providedData source;Return to step (4) after update;
(6) step (4) to (5) is repeated, until the confidence level of all properties value restrains;For need to only return to result
Attribute, it is final result to find out the highest attribute value of confidence level under the attribute.
2. a kind of multi-source data collection cleaning method based on predicate according to claim 1, which is characterized in that described preferential
Grade predicate definition method be:For attribute AiAnd AjIf meeting pscore(Ai) < pscore(Aj), then define priority predicate
Prior(Ai, Aj), indicate attribute AiPriority pscore(Ai) it is higher than attribute AjPriority pscore(Aj);Wherein, H (Ai) indicate attribute AiShannon entropy, pn(Ai) indicate attribute AiAll properties value
The ratio of middle null values.
3. a kind of multi-source data collection cleaning method based on predicate according to claim 2, which is characterized in that the state
Predicate and interaction predicate are obtained by first order logic predicate method for digging.
4. a kind of multi-source data collection cleaning method based on predicate according to claim 3, which is characterized in that data
Before collection is cleaned, handmarking is carried out for all properties of all data sets, each attribute is marked to need to return to one
As a result still multiple as a result, if an attribute need to only return one as a result, if mark the attribute to be single-value attribute, will when cleaning
The highest attribute value of confidence level is final result under the attribute;If an attribute there may be it is multiple as a result, if mark the category
Property be multi-valued attribute, all properties value that confidence level under the attribute is more than to when cleaning predetermined threshold value is final result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810578708.3A CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810578708.3A CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108776697A true CN108776697A (en) | 2018-11-09 |
CN108776697B CN108776697B (en) | 2020-06-09 |
Family
ID=64024668
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810578708.3A Active CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108776697B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582906A (en) * | 2018-11-30 | 2019-04-05 | 北京锐安科技有限公司 | Determination method, apparatus, equipment and the storage medium of data reliability |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1811772A (en) * | 2005-01-25 | 2006-08-02 | 翁托普里塞有限公司 | Integration platform for heterogeneous information sources |
US20090327255A1 (en) * | 2008-06-26 | 2009-12-31 | Microsoft Corporation | View matching of materialized xml views |
CN105045807A (en) * | 2015-06-04 | 2015-11-11 | 浙江力石科技股份有限公司 | Data cleaning algorithm based on Internet trading information |
CN105279232A (en) * | 2015-09-22 | 2016-01-27 | 武汉开目信息技术有限责任公司 | Method for showing screening and classification of data set in PDM (Product Data Management) system |
CN105608228A (en) * | 2016-01-29 | 2016-05-25 | 中国科学院计算机网络信息中心 | High-efficiency distributed RDF data storage method |
-
2018
- 2018-06-06 CN CN201810578708.3A patent/CN108776697B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1811772A (en) * | 2005-01-25 | 2006-08-02 | 翁托普里塞有限公司 | Integration platform for heterogeneous information sources |
US20090327255A1 (en) * | 2008-06-26 | 2009-12-31 | Microsoft Corporation | View matching of materialized xml views |
CN105045807A (en) * | 2015-06-04 | 2015-11-11 | 浙江力石科技股份有限公司 | Data cleaning algorithm based on Internet trading information |
CN105279232A (en) * | 2015-09-22 | 2016-01-27 | 武汉开目信息技术有限责任公司 | Method for showing screening and classification of data set in PDM (Product Data Management) system |
CN105608228A (en) * | 2016-01-29 | 2016-05-25 | 中国科学院计算机网络信息中心 | High-efficiency distributed RDF data storage method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582906A (en) * | 2018-11-30 | 2019-04-05 | 北京锐安科技有限公司 | Determination method, apparatus, equipment and the storage medium of data reliability |
Also Published As
Publication number | Publication date |
---|---|
CN108776697B (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karthikeyan et al. | A survey on association rule mining | |
Dror et al. | Replicability analysis for natural language processing: Testing significance with multiple datasets | |
Petry | Fuzzy databases: principles and applications | |
Korth et al. | System/U: a database system based on the universal relation assumption | |
Pernelle et al. | An automatic key discovery approach for data linking | |
Völker et al. | Automatic acquisition of class disjointness | |
San Martın et al. | SNQL: A social networks query and transformation language | |
Varga et al. | Conceptual design of document NoSQL database with formal concept analysis | |
CN107391542A (en) | A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates | |
Nguyen et al. | Efficient mining of class association rules with the itemset constraint | |
Anam et al. | Adapting a knowledge-based schema matching system for ontology mapping | |
CN103838804A (en) | Social network user interest association rule mining method based on community division | |
Hong et al. | Mining rules from an incomplete dataset with a high missing rate | |
Louhdi et al. | Transformation rules for building owl ontologies from relational databases | |
CN108776697A (en) | A kind of multi-source data collection cleaning method based on predicate | |
CN103294791A (en) | Extensible markup language pattern matching method | |
CN106547877B (en) | Data element Smart Logo analytic method based on 6W service logic model | |
Khan et al. | Logics for information systems and their dynamic extensions | |
Chen et al. | Join cardinality estimation by combining operator-level deep neural networks | |
Bogorny et al. | Semantic-based pruning of redundant and uninteresting frequent geographic patterns | |
Niepert et al. | Probabilistic-logical web data integration | |
Kern et al. | A framework for building logical schema and query decomposition in data warehouse federations | |
Xie et al. | Instance-driven ontology evolution mechanism towards enterprise data management | |
Liang et al. | Mining social ties beyond homophily | |
Jaleel et al. | Ontology construction from relational database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |