CN108776697B - Multi-source data set cleaning method based on predicates - Google Patents

Multi-source data set cleaning method based on predicates Download PDF

Info

Publication number
CN108776697B
CN108776697B CN201810578708.3A CN201810578708A CN108776697B CN 108776697 B CN108776697 B CN 108776697B CN 201810578708 A CN201810578708 A CN 201810578708A CN 108776697 B CN108776697 B CN 108776697B
Authority
CN
China
Prior art keywords
attribute
data
credibility
predicate
predicates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810578708.3A
Other languages
Chinese (zh)
Other versions
CN108776697A (en
Inventor
谢子哲
李论
刘奇志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201810578708.3A priority Critical patent/CN108776697B/en
Publication of CN108776697A publication Critical patent/CN108776697A/en
Application granted granted Critical
Publication of CN108776697B publication Critical patent/CN108776697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a predicate-based multi-source data set cleaning method, which can effectively identify the most reliable data items from isomorphic multi-source data sets and relates to the fields of data cleaning, data fusion and the like. The method comprises the following steps: 1) mining predicates by an automatic method, and filtering the mined predicates; 2) deducing the credibility of the attribute value of each entity in the data set according to the predicate; 3) establishing a relation between attribute value credibility and data source credibility, and calculating the data source credibility; 4) and finding out the data item with the highest credibility by combining the credibility of the data source and the credibility of the attribute value. For a plurality of data sources, the invention can analyze the information from different data sources but with the same content, filter out redundant, wrong and outdated data, and leave the data with the highest credibility, thereby having important significance for the efficiency and accuracy of subsequent data processing on the basis of subsequent data analysis and compaction.

Description

Multi-source data set cleaning method based on predicates
Technical Field
The invention relates to the fields of data cleaning, data fusion and the like, in particular to a multi-source data set cleaning method based on predicates.
Background
In the information age, description data of the same event or object can be found from a large number of data sources, and meanwhile, due to time errors, format errors, accuracy, completeness, semantic ambiguity and the like, description of the same entity from different data sources is inconsistent. After data are collected from different data sources, the inconsistency among the description data belonging to the same entity is solved, and the method is of great importance for subsequent data analysis. Simple voting strategies-selecting descriptions supported by more data sources-are not suitable for the current Web environment, and more complex washing strategies need to be designed by considering the credibility of the data sources, the credibility of the data and some priori knowledge. The existing cleaning strategies mainly comprise the following steps:
the application document of chinese patent No. 201410387772 discloses a "system and method for processing bus traffic conditions based on traffic multi-source data fusion", which fuses traffic data describing bus traffic conditions from different data sources to obtain displayable traffic information. The input of the method is specific traffic data, reliability judgment is not carried out according to predicates, and the reliability of a data source is not calculated according to the relation between the data and the data source.
The application document of chinese patent No. 201110369877 discloses a multi-source data integration platform and a construction method thereof, which manages different data, and the data do not have consistency problem.
The application document of US 8190546 discloses "Dependency between sources intussuth discovery" which evaluates the credibility of data sources and data by establishing a probabilistic graph model of the copy relationship between the data sources, and does not involve evaluating the credibility of the data with predicates.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a multi-source data set cleaning method based on data source credibility and data credibility, which aims to solve the problems that the initial value of the data credibility is difficult to determine in the multi-source data consistency problem and how to combine the data source credibility and the data credibility in the multi-source data fusion at present, and achieves the purpose of cleaning data by setting predicates to calculate the data credibility, then calculating the data source credibility through the data credibility, and finally finding out the data with the highest credibility.
The technical scheme is as follows: in order to achieve the technical effects, the invention provides a multi-source data set cleaning method based on predicates, which comprises the following steps:
(1) constructing a predicate model: defining a priority predicate, a state predicate and an interaction predicate; wherein the content of the first and second substances,
the priority predicate is priority (A)i,Aj) Represents an attribute AiIs higher in priority than attribute AjThe priority of (2);
the state predicates are as follows:
Figure BDA0001686405620000021
wherein, tiThe expression is given to the sentence i,
Figure BDA0001686405620000022
representing an attribute A in a statement ikThe value of the attribute of (a) is,
Figure BDA0001686405620000023
representing predefined
Figure BDA0001686405620000024
And
Figure BDA0001686405620000025
with respect to the condition, phi (t)i,tj) Representing a predefined tiAnd tjThe condition satisfied in (a); stat (A)k) When t is showniAnd tjWhen the conditions P and phi are satisfied, tiIs higher than tj
The interaction predicates are as follows: interδ(A1,…,Al) Indicates that when the data satisfies the condition delta, the attribute A of the piece of data1,…,AlThe quality of the attribute value of (2) is poor;
(2) carrying out predicate mining on the data set to be cleaned through the predicate model defined in the step (1) to obtain a priority predicate, a state predicate and an interaction predicate in the data set;
(3) deducing the attribute value credibility of each data in the data set according to the obtained predicate, comprising the following steps:
(3-1) initializing the credibility of all attribute values of the data in the data set to be 0, and setting an influence factor η as a constant for each attribute value of each piece of data;
(3-2) updating the credibility of each attribute value of each piece of data by using the state predicates and the interaction predicates, wherein during updating, the state predicates are firstly used for updating and then the interaction predicates are used for updating, or the interaction predicates are firstly used for updating and then the state predicates are used for updating;
the method for updating the credibility of each attribute value of the data by using the state predicates comprises the following steps: enumerating two data t in the dataset two by twoiAnd tjIf t isiAnd tjAt attribute AkAnd (3) satisfying the state predicates:
Figure BDA0001686405620000026
then the attribute value is added
Figure BDA0001686405620000027
η is subtracted;
the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing data setsIf a piece of data meets a certain interaction predicate Interδ(A1,…,Al) Then the piece of data attribute A is added1,…,Alη is subtracted from the confidence level of the attribute value of (a);
(3-3) after the step (2) is finished, updating the attribute value credibility of each piece of data by using the priority predicates, and executing the priority predicates in sequence from high priority to low priority during updating;
execution priority predicate Primary (A)i,Aj) Comprises the following steps: if multiple pieces of data are in attribute AjIf the confidence of the attribute values of (A) is the same, they are set to AiThe attribute values of (1) are sorted in ascending order according to the reliability of the attribute values, and the attribute values are sorted in the order of the sort in the A of the data of the nth bitjAdding n-1 to the credibility of the attribute value of (1);
(3-4) after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as results for the multi-value attribute; for the attribute which only needs to return one result, executing the steps (4) to (6);
(4) normalizing the credibility of all attribute values; according to the formula
Figure BDA0001686405620000031
Calculating the credibility of all data sources in the data set to be cleaned; wherein λ isiRepresenting a data source DiT denotes the data source DiD (t) represents the credibility of the data t, and the credibility of the data t is equal to the sum of the credibility of all attribute values of the data;
(5) according to the formula
Figure BDA0001686405620000032
Update the confidence of each attribute value, D' represents the confidence for attribute AjProviding attribute values
Figure BDA0001686405620000034
The data source of (1); returning to the step (4) after updating;
(6) repeatedly executing the steps (4) to (5) until the credibility of all the attribute values is converged; and for the attribute which only needs to return one result, finding out the attribute value with the highest credibility under the attribute as the final result.
Further, the definition method of the priority predicate includes: for attribute AiAnd AjIf p is satisfiedscore(Ai)<pscore(Aj) Then define the priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);
Figure BDA0001686405620000033
Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value among all the attribute values of (1).
Further, the state predicate and the interaction predicate are obtained by a first-order logic predicate mining method.
Further, before cleaning the data set, manually marking all attributes of all the data sets, marking whether each attribute needs to return one result or a plurality of results, if one attribute only needs to return one result, marking the attribute as a single-value attribute, and taking the attribute value with the highest reliability under the attribute as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.
Has the advantages that: compared with the prior art, the invention has the following advantages:
attribute values with high reliability are found out by using predicates and the relation between data sets and the attribute values which are automatically mined without assuming that only one correct value exists in one attribute, relying on crowdsourcing and large amount of manual intervention. According to the method, the credibility of the attribute values is scored by mining the custom predicates, the attribute values with the credibility higher than the preset threshold are found out for the multi-answer attributes as results, the credibility of the attribute values is further updated for the remaining attributes by combining the relation between the credibility of the data source and the credibility of the attribute values, the attribute values with the highest credibility are found as the results, and the method has important significance for improving the efficiency of data analysis and the accuracy of data analysis. By adopting the technical scheme of the invention, engineers can easily realize related software.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a process for updating the confidence level of attribute values in a data source according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
Fig. 1 shows a flow chart of the present invention, which mainly includes the following parts:
a) first, three predicates are defined:
1) and (3) priority predicates: for attribute AiAnd AjIf p isscore(Ai)<pscore(Aj) Then define a priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);
Figure BDA0001686405620000041
Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value among all the attribute values of (1).
H(Ai) The calculation formula of (2) is as follows: h (A)i)=-∑x∈Xp(x)log2p (X), X is attribute AiThe value range of the attribute value, p (x), represents the weight of the attribute value x to all attribute values (excluding null values).
2) And (3) state predicates: the state predicate is a first order logic predicate of the form:
Figure BDA0001686405620000042
represents tiAnd tjSatisfy the conditions P and φ, then tiIs higher than tj
Conditions in the above definitions
Figure BDA0001686405620000043
And fi(v1,v2) Can be v1=v2Or v1≠v2And (6) replacing. P in the state predicate definition can be replaced by 6 predicates defined in advance, P respectively1(v1,v2)、P2(v1,v2)、P3(v1,v2)、P4(v1,v2)、P5(v1,v2)、P6(v1,v2)。P1(v1,v2)、P2(v1,v2) Attribute value, P, adapted to numerical type1(v1,v2) Denotes v1Ratio v2Large, P2(v1,v2) Denotes v1Ratio v2Small; p3(v1,v2)、P4(v1,v2) Attribute value, P, for character type3(v1,v2) Denotes v1Ratio v2Length, P4(v1,v2) Denotes v1Ratio v2Short; p5(v1,v2)、P6(v1,v2) The attribute value suitable for the character type represents that the character string contains more information in more detail, and the measurement mode is that the Shannon entropy formula is used for comparing the information content contained by the two character strings, P5(v1,v2) Denotes v1Ratio v2In more detail, P6(v1,v2) Denotes v1Ratio v2And more briefly.
3) Interaction predicates: the interactive predicate is a first-order logic predicate in the form of Interδ(A1,…,Al) A represents that when a piece of data satisfies the condition delta, A of the piece of data1,…,AlThe attribute values are of poor quality.
In the above definition
Figure BDA0001686405620000051
Wherein P isi' can be P1~P6Any predicate can be replaced, and meanwhile, the predicate can be replaced by the following 4 predicates: p7(v1,v2)、P8(v1,v2)、P9(v1,v2)、P10(v1,v2) (ii) a Wherein, P7(v1,v2)、P8(v1,v2) Attribute value, P, for character type7(v1,v2) Denotes v1Containing v2,P8(v1,v2) Denotes v1Does not contain v2;P9(v1,v2)、P10(v1,v2) Attribute values, P, adapted for character types and value types9(v1,v2) Denotes v1Is equal to v2,P10(v1,v2) Denotes v1Is not equal to v2
Then carrying out predicate mining according to the data set: for the priority predicates, the formula can be used
Figure BDA0001686405620000052
Calculating the priority of all attributes of the data set to obtain the priority; and for the state predicate and the interaction predicate, the state predicate and the interaction predicate are automatically obtained by a first-order inductive learning method according to the definition of the first-order logic predicate. After the state predicate and the interaction predicate are obtained, in order to further improve the usability of the predicate, a domain expert can be requested to remove an invalid predicate to obtain a final usable predicate;
b) firstly, initializing the credibility of all attribute values to be 0, manually setting parameters η (influence factors) to be a real number, and then executing three types of predicates in the following sequence to deduce the credibility of the attribute values of all entities in the data set;
1) applying a state predicate: number enumerated in pairsTwo pieces of data in a data set, if the two pieces of data satisfy a state predicate
Figure BDA0001686405620000053
Then will be
Figure BDA0001686405620000054
η is subtracted.
2) Applying an interaction predicate: traversing all data in the data set, if one data satisfies a certain interaction predicate Interδ(A1,…,Al) Then the piece of data A is processed1,…,AlAttribute value confidence is subtracted η.
3) Applying a priority predicate: since the priority of the attribute is
Figure BDA0001686405620000061
To update the confidence of the current attribute value, a priority predicate with high priority, i.e., p of two attributes included, is first executedscoreAnd smaller priority predicates. Priority predicates priority (A)i,Aj) Is that if two pieces of data t1、t2Satisfies the conditions
Figure BDA0001686405620000062
At this time, the attribute A can be passediIs judged according to the reliability of
Figure BDA0001686405620000063
And
Figure BDA0001686405620000064
which is good. The method is that for all AjA plurality of data with the same credibility are processed according to AiIs sorted in ascending order, in the sorted order, in the order of A of the data arranged in the nth bitjIs added with n-1 to the confidence of the attribute value, thus according to the A with higher priorityiDistinguish A with the same credibilityjThe value of (c). Note also that for attributes that require multiple results to be returned, if the confidence level is negative, thenNo priority predicates need to be applied.
And after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as the result for the attribute needing to return a plurality of results, and continuing the following steps for the attribute needing to return a result.
c) According to
Figure BDA0001686405620000065
Calculating the credibility of the data sources and normalizing the credibility of all the data sources, namely sigmaiλi1 as shown in fig. 2. Reuse of
Figure BDA0001686405620000066
And updating the credibility of each attribute value, wherein the credibility of each attribute value is equal to the sum of the credibility of the data sources providing the attribute value multiplied by the original credibility of the attribute value, and the credibility of the data source of the null value is the credibility of the data source of the null value and does not include the credibility of other data sources providing the null value. The confidence levels of all attribute values are then also normalized so that for any one attribute, the sum of the confidence levels of all possible values taken is 1. And repeating the steps until the credibility of the data source and the credibility of each attribute value converge.
d) And finally, for the attribute which only needs to return one result, finding out the attribute value with the highest reliability under the attribute as a final result, and combining the result in the step b) as the final result.
The embodiments of the present invention will be described below with reference to specific examples:
let us order a data source DiHas a degree of confidence of λiThe data source attribute set is { A }1,…,AnIs multiplied by t e DiIs a piece of data of the data source, wherein
Figure BDA0001686405620000067
Represents t corresponds to AjThe attribute value of (2). Let d (t) represent the confidence level of the piece of data,
Figure BDA0001686405620000071
representing property values
Figure BDA0001686405620000072
The reliability of (2). The credibility of one piece of data is equal to the sum of the credibility of all attribute values of the piece of data, namely:
Figure BDA0001686405620000073
and the credibility of one data source is equal to the average of the credibility of all the data contained in the data source, namely the sum of the credibility of all the data is divided by the number of the data:
Figure BDA0001686405620000074
meanwhile, let D' be for attribute AjProviding attribute values
Figure BDA0001686405620000075
The data source of (2) the attribute value
Figure BDA0001686405620000076
The confidence level of (c) is the sum of the confidence levels of all data sources providing the value multiplied by the original confidence level of itself:
Figure BDA0001686405620000077
example (b): the cleansing data set is shown in the following table, with a total of 5 data and 5 data sources, where tiFrom a data source DiData of a scientific researcher named Mary is described.
Cleaning data set table
Figure BDA0001686405620000078
Firstly, a data set is simply observed, manual preprocessing is carried out, and some obvious unreasonable data are eliminated, so that the subsequent data cleaning efficiency is higher, and the effect is better.
Such as t in the data set above5The value of salary is negative, obviously unreasonable, and the values of the three attributes of Research Area, affinity and Publication in the data are not significant, so t can be used5This data is directly deleted as noise without participating in the subsequent cleaning operation.
See again t4This data, whose publication attribute has a "-" value, also belongs to unreasonable data, but because of t4The values of other attributes of the data are still of reference value, so that the value of publication can be directly changed to 'null'.
After the above simple pre-processing, the resulting data set is as follows:
Figure BDA0001686405620000081
firstly, predicate mining is carried out on a data set.
Mining priority predicates
For the priority predicates, the entropy of all attributes and the proportion of null values are counted.
The entropy calculation formula is as follows: h (A)i)=-∑x∈Xp(x)log2p(x)
Wherein p (x) represents the specific gravity of the attribute value x to all attribute values (null value is not included);
taking salary as an example, there are 3 attribute values, which are 142k, 120k and 88k respectively.
Wherein the content of the first and second substances,
Figure BDA0001686405620000082
the entropy of the available attribute salary is:
Figure BDA0001686405620000083
the same principle is that:
Figure BDA0001686405620000084
Figure BDA0001686405620000085
Figure BDA0001686405620000086
pn(Salary)=pn(ResearchArea)=0
Figure BDA0001686405620000091
Figure BDA0001686405620000092
it is possible to obtain:
Figure BDA0001686405620000093
Figure BDA0001686405620000094
Figure BDA0001686405620000095
Figure BDA0001686405620000096
Figure BDA0001686405620000097
three priority predicates can be defined according to the less than relationship:
Figure BDA0001686405620000098
Figure BDA0001686405620000099
Figure BDA00016864056200000910
mining state predicates:
automatically obtaining by a First Order logical predicate mining algorithm First Order index indicative Learner:
Figure BDA00016864056200000911
and (3) mining interaction predicates:
also automatically obtained by a First Order logical predicate mining algorithm First Order index induced Learner:
Figure BDA00016864056200000912
Figure BDA00016864056200000913
and secondly, deducing the credibility of the attribute values of all the entities.
The confidence level of all attribute values is initialized to 0, the influence factor η is set to 1, the confidence threshold value is set to 0 for attributes that need to return multiple results, then the corresponding predicates are used in a certain order (different predicate execution orders may produce different results).
The state predicate and the interaction predicate both act on the attribute value, and the attribute value does not change, so that the state predicate and the state predicate, the interaction predicate and the interaction predicate are independent from each other, and can be called in any sequence.
However, the priority predicate works above the trustworthiness of the attribute values, so the priority predicate must be used after all state and interaction predicates.
In addition, the priority predicate and the priority predicate are also connectedA certain order must be observed. In order to make the credibility of the current attribute value up to date, a priority predicate with high priority, namely a priority predicate containing two attributes with smaller sum of priority needs to be executed first. For the priority predicate that cleans the dataset table,
Figure BDA0001686405620000101
p of (a)scoreThe sum of the total weight of the components is 3.5,
Figure BDA0001686405620000102
p of (a)scoreThe sum is also 3.5 and,
Figure BDA0001686405620000103
p of (a)scoreThe sum is 4.11, so should be
Figure BDA0001686405620000104
The order of execution priority predicates.
After the state predicates are executed, the trustworthiness of all attribute values is shown in Table 1. There are 4 pieces of data, and two pieces of data need to be compared, and 16 times of comparison are carried out in total. With t1And t2For example, predicates are based on state
Figure BDA0001686405620000105
Because of the fact that
Figure BDA0001686405620000106
So will
Figure BDA0001686405620000107
Minus 1.
TABLE 1
Salary Research Area Affiliation Publication
t1(D1) 0 0 0 0
t2(D2) -1 0 0 0
t3(D3) -2 0 0 0
t4(D4) -2 0 0 0
After executing the interaction predicates, the trustworthiness of all attribute values is shown in Table 2. Here according to interaction predicates
Figure BDA0001686405620000108
And
Figure BDA0001686405620000109
attribute value credibility for all affinity and Publication values to null is reduced by 1.
TABLE 2
Salary Research Area Affiliation Publication
t1(D1) 0 0 0 0
t2(D2) -1 0 0 -1
t3(D3) -2 0 0 0
t4(D4) -2 0 -1 -1
After the priority predicates are executed, the trustworthiness of all attribute values is shown in Table 3. Taking the Research Area as an example, the initial Research Area column has a value of {0, 0, 0, 0]Predicates on priority
Figure BDA0001686405620000111
The value of the Research Area column may be updated according to 0, -1, -2, -2) of the Salary column. Reordering the values with the same reliability in the Research Area column according to the ascending order of the Salary column, adding 0, 1, 2 after the ordering, and reducing the order to obtain {2, 1, 0]. Re-execution priority predicates on the same reason
Figure BDA0001686405620000112
And
Figure BDA0001686405620000113
TABLE 3
Salary Research Area Affiliation Publication
t1(D1) 0 2 2 1
t2(D2) -1 1 1 -1
t3(D3) -2 0 0 0
t4(D4) -2 0 -1 -1
All attributes are then marked, and a researcher can only have one payroll and one affiliate at a time, so both Salary and affinity return only one result, but there can be more than one researcher's research area and work, so the research and Publication attributes should return more than one result value. For the multi-value attribute, all attribute values greater than or equal to the threshold value 0 are returned at this time, that is, for the Research Area, the returned result is { Data integration, Data clarification & Google Knowledge management information retrieval }, and for the Publication, the returned result is { Data integration, adaptive tool for Data errors }.
And thirdly, calculating the reliability of the data source.
Then pass through
Figure BDA0001686405620000121
The confidence values for all attribute values are mapped to (0, 1) and normalized, with the results shown in table 4. And in accordance with
Figure BDA0001686405620000122
And calculating the credibility of all data sources.
TABLE 4
Salary Research Area Affiliation Publication λ
t1(D1) 0.496353 0.33723 0.369959 0.413275 0.404204
t2(D2) 0.26698 0.2799 0.307065 0.152035 0.251495
t3(D3) 0.118333 0.191435 0.210014 0.282655 0.200609
t4(D4) 0.118333 0.191435 0.112963 0.152035 0.143692
Finally pass through
Figure BDA0001686405620000123
And
Figure BDA0001686405620000124
and (4) updating the credibility of all the attribute values by two formula iterations until the credibility of all the attribute values is converged, and after the credibility of the attribute values is updated by columns every time, the credibility of the columns needs to be normalized. Taking the first updating process as an example, for the attribute value credibility of the Salary column:
{0.496353,0.26698,0.118333,0.118333]
→{0.496353×0.404204,0.26698×0.251495,0.118333×0.344301,0.118333×0.344301]
→{0.200628,0.0671441,0.0407422,0.0407422)
Figure BDA0001686405620000125
similarly, for the attribute value credibility of the Research Area column:
{0.33723,0.2799,0.191435,0.191435)→{0.500009,0.258216,0.140871,0.100903]
attribute value confidence for affinity column:
{0.369959,0.307065,0.210014,0.112963)→{0.404204,0.251495,0.200609,0.143692)
attribute value confidence for Publication column:
{0.413275,0.152035,0.282655,0.152035}→{0.588542,0.134713,0.199777,0.0769686]
and finally updating the credibility of the data source:
λ={0.5168,0.209168,0.164478,0.109554)
the above process was repeated until convergence, and the final results are shown in table 5.
TABLE 5
Salary Research Area Affiliation Publication λ
t1(D1) 1 1 1 1 1
t2(D2) 0 0 0 0 0
t3(D3) 0 0 0 0 0
t4(D4) 0 0 0 0 0
And fourthly, obtaining a result.
According to table 5, the best attribute value of the Salary and affinity attributes can be selected, i.e., the attribute value with the highest confidence level is the result. Wherein the Salary result is {142k } and the Aftilization result is { Amazon }.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (3)

1. A multi-source data set cleaning method based on predicates is characterized by comprising the following steps:
(1) constructing a predicate model: defining a priority predicate, a state predicate and an interaction predicate; wherein the content of the first and second substances,
and (3) priority predicates: for attribute AiAnd AjIf p isscore(Ai)<pscore(Aj) Then define a priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);
Figure FDA0002454550110000011
Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value ratio among all the attribute values of (1);
the state predicates are as follows:
Figure FDA0002454550110000012
wherein, tiThe expression is given to the sentence i,
Figure FDA0002454550110000013
representing an attribute A in a statement ikThe value of the attribute of (a) is,
Figure FDA0002454550110000014
representing predefined
Figure FDA0002454550110000015
And
Figure FDA0002454550110000016
with respect to the condition, phi (t)i,tj) Representing a predefined tiAnd tjThe condition satisfied in (a); stat (A)k) When t is showniAnd tjWhen the conditions P and phi are satisfied, tiIs higher than tj
The interaction predicates are as follows: interδ(A1,…,Al) Indicates that when the data satisfies the condition delta, the attribute A of the piece of data1,…,AlThe quality of the attribute value of (2) is poor;
(2) carrying out predicate mining on the data set to be cleaned through the predicate model defined in the step (1) to obtain a priority predicate, a state predicate and an interaction predicate in the data set;
(3) deducing the attribute value credibility of each data in the data set according to the obtained predicate, comprising the following steps:
(3-1) initializing the credibility of all attribute values of the data in the data set to be 0, and setting an influence factor η as a constant for each attribute value of each piece of data;
(3-2) updating the credibility of each attribute value of each piece of data by using the state predicates and the interaction predicates, wherein during updating, the state predicates are firstly used for updating and then the interaction predicates are used for updating, or the interaction predicates are firstly used for updating and then the state predicates are used for updating;
the method for updating the credibility of each attribute value of the data by using the state predicates comprises the following steps: enumerating two data t in the dataset two by twoiAnd tjIf t isiAnd tjAt attribute AkAnd (3) satisfying the state predicates:
Figure FDA0002454550110000017
then the attribute value is added
Figure FDA0002454550110000018
η is subtracted;
the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing all data in the data set, if one data satisfies a certain interaction predicate Interδ(A1,…,Al) Then the piece of data attribute A is added1,…,Alη is subtracted from the confidence level of the attribute value of (a);
(3-3) after the step (2) is finished, updating the attribute value credibility of each piece of data by using the priority predicates, and executing the priority predicates in sequence from high priority to low priority during updating;
execution priority predicate Primary (A)i,Aj) Comprises the following steps: if multiple pieces of data are in attribute AjIf the confidence of the attribute values of (A) is the same, they are set to AiThe attribute values of (1) are sorted in ascending order according to the reliability of the attribute values, and the attribute values are sorted in the order of the sort in the A of the data of the nth bitjAdding n-1 to the credibility of the attribute value of (1);
(3-4) after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as results for the multi-value attribute; for the attribute which only needs to return one result, executing the steps (4) to (6);
(4) normalizing the credibility of all attribute values; according to the formula
Figure FDA0002454550110000021
Calculating the credibility of all data sources in the data set to be cleaned; wherein λ isiRepresenting a data source DiT denotes the data source DiD (t) represents the credibility of the data t, and the credibility of the data t is equal to the sum of the credibility of all attribute values of the data;
(5) according to the formula
Figure FDA0002454550110000022
λkUpdate the confidence of each attribute value, D' represents the confidence for attribute AjProviding attribute values
Figure FDA0002454550110000023
The data source of (1); returning to the step (4) after updating;
(6) repeatedly executing the steps (4) to (5) until the credibility of all the attribute values is converged; and for the attribute which only needs to return one result, finding out the attribute value with the highest credibility under the attribute as the final result.
2. The method for cleaning a multi-source data set based on a predicate of claim 1, wherein the state predicate and the interaction predicate are both obtained by a first-order logic predicate mining method.
3. The multi-source data set cleaning method based on predicates of claim 2, wherein before cleaning the data set, all attributes of all the data sets are manually marked, whether each attribute needs to return one result or a plurality of results is marked, if one attribute only needs to return one result, the attribute is marked as a single-value attribute, and the attribute value with the highest reliability under the attribute is taken as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.
CN201810578708.3A 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates Active CN108776697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810578708.3A CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810578708.3A CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Publications (2)

Publication Number Publication Date
CN108776697A CN108776697A (en) 2018-11-09
CN108776697B true CN108776697B (en) 2020-06-09

Family

ID=64024668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810578708.3A Active CN108776697B (en) 2018-06-06 2018-06-06 Multi-source data set cleaning method based on predicates

Country Status (1)

Country Link
CN (1) CN108776697B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582906B (en) * 2018-11-30 2021-06-15 北京锐安科技有限公司 Method, device, equipment and storage medium for determining data reliability

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811772A (en) * 2005-01-25 2006-08-02 翁托普里塞有限公司 Integration platform for heterogeneous information sources
CN105045807A (en) * 2015-06-04 2015-11-11 浙江力石科技股份有限公司 Data cleaning algorithm based on Internet trading information
CN105279232A (en) * 2015-09-22 2016-01-27 武汉开目信息技术有限责任公司 Method for showing screening and classification of data set in PDM (Product Data Management) system
CN105608228A (en) * 2016-01-29 2016-05-25 中国科学院计算机网络信息中心 High-efficiency distributed RDF data storage method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8560523B2 (en) * 2008-06-26 2013-10-15 Microsoft Corporation View matching of materialized XML views

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811772A (en) * 2005-01-25 2006-08-02 翁托普里塞有限公司 Integration platform for heterogeneous information sources
CN105045807A (en) * 2015-06-04 2015-11-11 浙江力石科技股份有限公司 Data cleaning algorithm based on Internet trading information
CN105279232A (en) * 2015-09-22 2016-01-27 武汉开目信息技术有限责任公司 Method for showing screening and classification of data set in PDM (Product Data Management) system
CN105608228A (en) * 2016-01-29 2016-05-25 中国科学院计算机网络信息中心 High-efficiency distributed RDF data storage method

Also Published As

Publication number Publication date
CN108776697A (en) 2018-11-09

Similar Documents

Publication Publication Date Title
CN110674850A (en) Image description generation method based on attention mechanism
CN107480694B (en) Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform
WO2021226809A1 (en) Method and system for constructing knowledge map of manufacturing field
CN111712809A (en) Learning ETL rules by example
US20140052695A1 (en) Systems and methods for capturing data refinement actions based on visualized search of information
CN110020176A (en) A kind of resource recommendation method, electronic equipment and computer readable storage medium
CN110389950B (en) Rapid running big data cleaning method
CN110737805A (en) Method and device for processing graph model data and terminal equipment
Greco et al. Certain query answering in partially consistent databases
CN108776697B (en) Multi-source data set cleaning method based on predicates
CN113032642A (en) Data processing method, device and medium for target object and electronic equipment
CN112905906B (en) Recommendation method and system fusing local collaboration and feature intersection
WO2012133941A1 (en) Method for matching elements in schemas of databases using bayesian network
Malik et al. A comprehensive approach towards data preprocessing techniques & association rules
CN114462894A (en) Data analysis-based e-commerce order material replacement assistant decision method
Vats et al. A junction tree framework for undirected graphical model selection
Meneghetti et al. Output-sensitive evaluation of prioritized skyline queries
Lahijani Semi-supervised data cleaning
Solanke et al. Migration of relational database to MongoDB and Data Analytics using Naïve Bayes classifier based on Mapreduce approach
CN112579667B (en) Data-driven engine multidisciplinary knowledge machine learning method and device
CN115328972B (en) Smooth autoregressive radix estimation method
US11886404B2 (en) Automated database modeling
CN117390064B (en) Database query optimization method based on embeddable subgraph
Anam et al. Schema mapping using hybrid ripple-down rules
Andreichicov et al. Intelligent software for the quality management of the technical solutions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant