CN108776697B - Multi-source data set cleaning method based on predicates - Google Patents
Multi-source data set cleaning method based on predicates Download PDFInfo
- Publication number
- CN108776697B CN108776697B CN201810578708.3A CN201810578708A CN108776697B CN 108776697 B CN108776697 B CN 108776697B CN 201810578708 A CN201810578708 A CN 201810578708A CN 108776697 B CN108776697 B CN 108776697B
- Authority
- CN
- China
- Prior art keywords
- attribute
- data
- credibility
- predicate
- predicates
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention provides a predicate-based multi-source data set cleaning method, which can effectively identify the most reliable data items from isomorphic multi-source data sets and relates to the fields of data cleaning, data fusion and the like. The method comprises the following steps: 1) mining predicates by an automatic method, and filtering the mined predicates; 2) deducing the credibility of the attribute value of each entity in the data set according to the predicate; 3) establishing a relation between attribute value credibility and data source credibility, and calculating the data source credibility; 4) and finding out the data item with the highest credibility by combining the credibility of the data source and the credibility of the attribute value. For a plurality of data sources, the invention can analyze the information from different data sources but with the same content, filter out redundant, wrong and outdated data, and leave the data with the highest credibility, thereby having important significance for the efficiency and accuracy of subsequent data processing on the basis of subsequent data analysis and compaction.
Description
Technical Field
The invention relates to the fields of data cleaning, data fusion and the like, in particular to a multi-source data set cleaning method based on predicates.
Background
In the information age, description data of the same event or object can be found from a large number of data sources, and meanwhile, due to time errors, format errors, accuracy, completeness, semantic ambiguity and the like, description of the same entity from different data sources is inconsistent. After data are collected from different data sources, the inconsistency among the description data belonging to the same entity is solved, and the method is of great importance for subsequent data analysis. Simple voting strategies-selecting descriptions supported by more data sources-are not suitable for the current Web environment, and more complex washing strategies need to be designed by considering the credibility of the data sources, the credibility of the data and some priori knowledge. The existing cleaning strategies mainly comprise the following steps:
the application document of chinese patent No. 201410387772 discloses a "system and method for processing bus traffic conditions based on traffic multi-source data fusion", which fuses traffic data describing bus traffic conditions from different data sources to obtain displayable traffic information. The input of the method is specific traffic data, reliability judgment is not carried out according to predicates, and the reliability of a data source is not calculated according to the relation between the data and the data source.
The application document of chinese patent No. 201110369877 discloses a multi-source data integration platform and a construction method thereof, which manages different data, and the data do not have consistency problem.
The application document of US 8190546 discloses "Dependency between sources intussuth discovery" which evaluates the credibility of data sources and data by establishing a probabilistic graph model of the copy relationship between the data sources, and does not involve evaluating the credibility of the data with predicates.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a multi-source data set cleaning method based on data source credibility and data credibility, which aims to solve the problems that the initial value of the data credibility is difficult to determine in the multi-source data consistency problem and how to combine the data source credibility and the data credibility in the multi-source data fusion at present, and achieves the purpose of cleaning data by setting predicates to calculate the data credibility, then calculating the data source credibility through the data credibility, and finally finding out the data with the highest credibility.
The technical scheme is as follows: in order to achieve the technical effects, the invention provides a multi-source data set cleaning method based on predicates, which comprises the following steps:
(1) constructing a predicate model: defining a priority predicate, a state predicate and an interaction predicate; wherein the content of the first and second substances,
the priority predicate is priority (A)i,Aj) Represents an attribute AiIs higher in priority than attribute AjThe priority of (2);
the state predicates are as follows:wherein, tiThe expression is given to the sentence i,representing an attribute A in a statement ikThe value of the attribute of (a) is,representing predefinedAndwith respect to the condition, phi (t)i,tj) Representing a predefined tiAnd tjThe condition satisfied in (a); stat (A)k) When t is showniAnd tjWhen the conditions P and phi are satisfied, tiIs higher than tj;
The interaction predicates are as follows: interδ(A1,…,Al) Indicates that when the data satisfies the condition delta, the attribute A of the piece of data1,…,AlThe quality of the attribute value of (2) is poor;
(2) carrying out predicate mining on the data set to be cleaned through the predicate model defined in the step (1) to obtain a priority predicate, a state predicate and an interaction predicate in the data set;
(3) deducing the attribute value credibility of each data in the data set according to the obtained predicate, comprising the following steps:
(3-1) initializing the credibility of all attribute values of the data in the data set to be 0, and setting an influence factor η as a constant for each attribute value of each piece of data;
(3-2) updating the credibility of each attribute value of each piece of data by using the state predicates and the interaction predicates, wherein during updating, the state predicates are firstly used for updating and then the interaction predicates are used for updating, or the interaction predicates are firstly used for updating and then the state predicates are used for updating;
the method for updating the credibility of each attribute value of the data by using the state predicates comprises the following steps: enumerating two data t in the dataset two by twoiAnd tjIf t isiAnd tjAt attribute AkAnd (3) satisfying the state predicates:then the attribute value is addedη is subtracted;
the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing data setsIf a piece of data meets a certain interaction predicate Interδ(A1,…,Al) Then the piece of data attribute A is added1,…,Alη is subtracted from the confidence level of the attribute value of (a);
(3-3) after the step (2) is finished, updating the attribute value credibility of each piece of data by using the priority predicates, and executing the priority predicates in sequence from high priority to low priority during updating;
execution priority predicate Primary (A)i,Aj) Comprises the following steps: if multiple pieces of data are in attribute AjIf the confidence of the attribute values of (A) is the same, they are set to AiThe attribute values of (1) are sorted in ascending order according to the reliability of the attribute values, and the attribute values are sorted in the order of the sort in the A of the data of the nth bitjAdding n-1 to the credibility of the attribute value of (1);
(3-4) after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as results for the multi-value attribute; for the attribute which only needs to return one result, executing the steps (4) to (6);
(4) normalizing the credibility of all attribute values; according to the formulaCalculating the credibility of all data sources in the data set to be cleaned; wherein λ isiRepresenting a data source DiT denotes the data source DiD (t) represents the credibility of the data t, and the credibility of the data t is equal to the sum of the credibility of all attribute values of the data;
(5) according to the formulaUpdate the confidence of each attribute value, D' represents the confidence for attribute AjProviding attribute valuesThe data source of (1); returning to the step (4) after updating;
(6) repeatedly executing the steps (4) to (5) until the credibility of all the attribute values is converged; and for the attribute which only needs to return one result, finding out the attribute value with the highest credibility under the attribute as the final result.
Further, the definition method of the priority predicate includes: for attribute AiAnd AjIf p is satisfiedscore(Ai)<pscore(Aj) Then define the priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value among all the attribute values of (1).
Further, the state predicate and the interaction predicate are obtained by a first-order logic predicate mining method.
Further, before cleaning the data set, manually marking all attributes of all the data sets, marking whether each attribute needs to return one result or a plurality of results, if one attribute only needs to return one result, marking the attribute as a single-value attribute, and taking the attribute value with the highest reliability under the attribute as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.
Has the advantages that: compared with the prior art, the invention has the following advantages:
attribute values with high reliability are found out by using predicates and the relation between data sets and the attribute values which are automatically mined without assuming that only one correct value exists in one attribute, relying on crowdsourcing and large amount of manual intervention. According to the method, the credibility of the attribute values is scored by mining the custom predicates, the attribute values with the credibility higher than the preset threshold are found out for the multi-answer attributes as results, the credibility of the attribute values is further updated for the remaining attributes by combining the relation between the credibility of the data source and the credibility of the attribute values, the attribute values with the highest credibility are found as the results, and the method has important significance for improving the efficiency of data analysis and the accuracy of data analysis. By adopting the technical scheme of the invention, engineers can easily realize related software.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a process for updating the confidence level of attribute values in a data source according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
Fig. 1 shows a flow chart of the present invention, which mainly includes the following parts:
a) first, three predicates are defined:
1) and (3) priority predicates: for attribute AiAnd AjIf p isscore(Ai)<pscore(Aj) Then define a priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value among all the attribute values of (1).
H(Ai) The calculation formula of (2) is as follows: h (A)i)=-∑x∈Xp(x)log2p (X), X is attribute AiThe value range of the attribute value, p (x), represents the weight of the attribute value x to all attribute values (excluding null values).
2) And (3) state predicates: the state predicate is a first order logic predicate of the form:
represents tiAnd tjSatisfy the conditions P and φ, then tiIs higher than tj。
Conditions in the above definitionsAnd fi(v1,v2) Can be v1=v2Or v1≠v2And (6) replacing. P in the state predicate definition can be replaced by 6 predicates defined in advance, P respectively1(v1,v2)、P2(v1,v2)、P3(v1,v2)、P4(v1,v2)、P5(v1,v2)、P6(v1,v2)。P1(v1,v2)、P2(v1,v2) Attribute value, P, adapted to numerical type1(v1,v2) Denotes v1Ratio v2Large, P2(v1,v2) Denotes v1Ratio v2Small; p3(v1,v2)、P4(v1,v2) Attribute value, P, for character type3(v1,v2) Denotes v1Ratio v2Length, P4(v1,v2) Denotes v1Ratio v2Short; p5(v1,v2)、P6(v1,v2) The attribute value suitable for the character type represents that the character string contains more information in more detail, and the measurement mode is that the Shannon entropy formula is used for comparing the information content contained by the two character strings, P5(v1,v2) Denotes v1Ratio v2In more detail, P6(v1,v2) Denotes v1Ratio v2And more briefly.
3) Interaction predicates: the interactive predicate is a first-order logic predicate in the form of Interδ(A1,…,Al) A represents that when a piece of data satisfies the condition delta, A of the piece of data1,…,AlThe attribute values are of poor quality.
In the above definitionWherein P isi' can be P1~P6Any predicate can be replaced, and meanwhile, the predicate can be replaced by the following 4 predicates: p7(v1,v2)、P8(v1,v2)、P9(v1,v2)、P10(v1,v2) (ii) a Wherein, P7(v1,v2)、P8(v1,v2) Attribute value, P, for character type7(v1,v2) Denotes v1Containing v2,P8(v1,v2) Denotes v1Does not contain v2;P9(v1,v2)、P10(v1,v2) Attribute values, P, adapted for character types and value types9(v1,v2) Denotes v1Is equal to v2,P10(v1,v2) Denotes v1Is not equal to v2。
Then carrying out predicate mining according to the data set: for the priority predicates, the formula can be usedCalculating the priority of all attributes of the data set to obtain the priority; and for the state predicate and the interaction predicate, the state predicate and the interaction predicate are automatically obtained by a first-order inductive learning method according to the definition of the first-order logic predicate. After the state predicate and the interaction predicate are obtained, in order to further improve the usability of the predicate, a domain expert can be requested to remove an invalid predicate to obtain a final usable predicate;
b) firstly, initializing the credibility of all attribute values to be 0, manually setting parameters η (influence factors) to be a real number, and then executing three types of predicates in the following sequence to deduce the credibility of the attribute values of all entities in the data set;
1) applying a state predicate: number enumerated in pairsTwo pieces of data in a data set, if the two pieces of data satisfy a state predicateThen will beη is subtracted.
2) Applying an interaction predicate: traversing all data in the data set, if one data satisfies a certain interaction predicate Interδ(A1,…,Al) Then the piece of data A is processed1,…,AlAttribute value confidence is subtracted η.
3) Applying a priority predicate: since the priority of the attribute isTo update the confidence of the current attribute value, a priority predicate with high priority, i.e., p of two attributes included, is first executedscoreAnd smaller priority predicates. Priority predicates priority (A)i,Aj) Is that if two pieces of data t1、t2Satisfies the conditionsAt this time, the attribute A can be passediIs judged according to the reliability ofAndwhich is good. The method is that for all AjA plurality of data with the same credibility are processed according to AiIs sorted in ascending order, in the sorted order, in the order of A of the data arranged in the nth bitjIs added with n-1 to the confidence of the attribute value, thus according to the A with higher priorityiDistinguish A with the same credibilityjThe value of (c). Note also that for attributes that require multiple results to be returned, if the confidence level is negative, thenNo priority predicates need to be applied.
And after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as the result for the attribute needing to return a plurality of results, and continuing the following steps for the attribute needing to return a result.
c) According toCalculating the credibility of the data sources and normalizing the credibility of all the data sources, namely sigmaiλi1 as shown in fig. 2. Reuse ofAnd updating the credibility of each attribute value, wherein the credibility of each attribute value is equal to the sum of the credibility of the data sources providing the attribute value multiplied by the original credibility of the attribute value, and the credibility of the data source of the null value is the credibility of the data source of the null value and does not include the credibility of other data sources providing the null value. The confidence levels of all attribute values are then also normalized so that for any one attribute, the sum of the confidence levels of all possible values taken is 1. And repeating the steps until the credibility of the data source and the credibility of each attribute value converge.
d) And finally, for the attribute which only needs to return one result, finding out the attribute value with the highest reliability under the attribute as a final result, and combining the result in the step b) as the final result.
The embodiments of the present invention will be described below with reference to specific examples:
let us order a data source DiHas a degree of confidence of λiThe data source attribute set is { A }1,…,AnIs multiplied by t e DiIs a piece of data of the data source, whereinRepresents t corresponds to AjThe attribute value of (2). Let d (t) represent the confidence level of the piece of data,representing property valuesThe reliability of (2). The credibility of one piece of data is equal to the sum of the credibility of all attribute values of the piece of data, namely:
and the credibility of one data source is equal to the average of the credibility of all the data contained in the data source, namely the sum of the credibility of all the data is divided by the number of the data:
meanwhile, let D' be for attribute AjProviding attribute valuesThe data source of (2) the attribute valueThe confidence level of (c) is the sum of the confidence levels of all data sources providing the value multiplied by the original confidence level of itself:
example (b): the cleansing data set is shown in the following table, with a total of 5 data and 5 data sources, where tiFrom a data source DiData of a scientific researcher named Mary is described.
Cleaning data set table
Firstly, a data set is simply observed, manual preprocessing is carried out, and some obvious unreasonable data are eliminated, so that the subsequent data cleaning efficiency is higher, and the effect is better.
Such as t in the data set above5The value of salary is negative, obviously unreasonable, and the values of the three attributes of Research Area, affinity and Publication in the data are not significant, so t can be used5This data is directly deleted as noise without participating in the subsequent cleaning operation.
See again t4This data, whose publication attribute has a "-" value, also belongs to unreasonable data, but because of t4The values of other attributes of the data are still of reference value, so that the value of publication can be directly changed to 'null'.
After the above simple pre-processing, the resulting data set is as follows:
firstly, predicate mining is carried out on a data set.
Mining priority predicates
For the priority predicates, the entropy of all attributes and the proportion of null values are counted.
The entropy calculation formula is as follows: h (A)i)=-∑x∈Xp(x)log2p(x)
Wherein p (x) represents the specific gravity of the attribute value x to all attribute values (null value is not included);
taking salary as an example, there are 3 attribute values, which are 142k, 120k and 88k respectively.
Wherein the content of the first and second substances,the entropy of the available attribute salary is:
the same principle is that:
pn(Salary)=pn(ResearchArea)=0
it is possible to obtain:
three priority predicates can be defined according to the less than relationship:
mining state predicates:
automatically obtaining by a First Order logical predicate mining algorithm First Order index indicative Learner:
and (3) mining interaction predicates:
also automatically obtained by a First Order logical predicate mining algorithm First Order index induced Learner:
and secondly, deducing the credibility of the attribute values of all the entities.
The confidence level of all attribute values is initialized to 0, the influence factor η is set to 1, the confidence threshold value is set to 0 for attributes that need to return multiple results, then the corresponding predicates are used in a certain order (different predicate execution orders may produce different results).
The state predicate and the interaction predicate both act on the attribute value, and the attribute value does not change, so that the state predicate and the state predicate, the interaction predicate and the interaction predicate are independent from each other, and can be called in any sequence.
However, the priority predicate works above the trustworthiness of the attribute values, so the priority predicate must be used after all state and interaction predicates.
In addition, the priority predicate and the priority predicate are also connectedA certain order must be observed. In order to make the credibility of the current attribute value up to date, a priority predicate with high priority, namely a priority predicate containing two attributes with smaller sum of priority needs to be executed first. For the priority predicate that cleans the dataset table,p of (a)scoreThe sum of the total weight of the components is 3.5,p of (a)scoreThe sum is also 3.5 and,p of (a)scoreThe sum is 4.11, so should beThe order of execution priority predicates.
After the state predicates are executed, the trustworthiness of all attribute values is shown in Table 1. There are 4 pieces of data, and two pieces of data need to be compared, and 16 times of comparison are carried out in total. With t1And t2For example, predicates are based on stateBecause of the fact thatSo will Minus 1.
TABLE 1
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 0 | 0 | 0 |
t2(D2) | -1 | 0 | 0 | 0 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | 0 | 0 |
After executing the interaction predicates, the trustworthiness of all attribute values is shown in Table 2. Here according to interaction predicatesAndattribute value credibility for all affinity and Publication values to null is reduced by 1.
TABLE 2
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 0 | 0 | 0 |
t2(D2) | -1 | 0 | 0 | -1 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | -1 | -1 |
After the priority predicates are executed, the trustworthiness of all attribute values is shown in Table 3. Taking the Research Area as an example, the initial Research Area column has a value of {0, 0, 0, 0]Predicates on priorityThe value of the Research Area column may be updated according to 0, -1, -2, -2) of the Salary column. Reordering the values with the same reliability in the Research Area column according to the ascending order of the Salary column, adding 0, 1, 2 after the ordering, and reducing the order to obtain {2, 1, 0]. Re-execution priority predicates on the same reasonAnd
TABLE 3
Salary | Research Area | Affiliation | Publication | |
t1(D1) | 0 | 2 | 2 | 1 |
t2(D2) | -1 | 1 | 1 | -1 |
t3(D3) | -2 | 0 | 0 | 0 |
t4(D4) | -2 | 0 | -1 | -1 |
All attributes are then marked, and a researcher can only have one payroll and one affiliate at a time, so both Salary and affinity return only one result, but there can be more than one researcher's research area and work, so the research and Publication attributes should return more than one result value. For the multi-value attribute, all attribute values greater than or equal to the threshold value 0 are returned at this time, that is, for the Research Area, the returned result is { Data integration, Data clarification & Google Knowledge management information retrieval }, and for the Publication, the returned result is { Data integration, adaptive tool for Data errors }.
And thirdly, calculating the reliability of the data source.
Then pass throughThe confidence values for all attribute values are mapped to (0, 1) and normalized, with the results shown in table 4. And in accordance withAnd calculating the credibility of all data sources.
TABLE 4
Salary | Research Area | Affiliation | Publication | λ | |
t1(D1) | 0.496353 | 0.33723 | 0.369959 | 0.413275 | 0.404204 |
t2(D2) | 0.26698 | 0.2799 | 0.307065 | 0.152035 | 0.251495 |
t3(D3) | 0.118333 | 0.191435 | 0.210014 | 0.282655 | 0.200609 |
t4(D4) | 0.118333 | 0.191435 | 0.112963 | 0.152035 | 0.143692 |
Finally pass throughAndand (4) updating the credibility of all the attribute values by two formula iterations until the credibility of all the attribute values is converged, and after the credibility of the attribute values is updated by columns every time, the credibility of the columns needs to be normalized. Taking the first updating process as an example, for the attribute value credibility of the Salary column:
{0.496353,0.26698,0.118333,0.118333]
→{0.496353×0.404204,0.26698×0.251495,0.118333×0.344301,0.118333×0.344301]
→{0.200628,0.0671441,0.0407422,0.0407422)
similarly, for the attribute value credibility of the Research Area column:
{0.33723,0.2799,0.191435,0.191435)→{0.500009,0.258216,0.140871,0.100903]
attribute value confidence for affinity column:
{0.369959,0.307065,0.210014,0.112963)→{0.404204,0.251495,0.200609,0.143692)
attribute value confidence for Publication column:
{0.413275,0.152035,0.282655,0.152035}→{0.588542,0.134713,0.199777,0.0769686]
and finally updating the credibility of the data source:
λ={0.5168,0.209168,0.164478,0.109554)
the above process was repeated until convergence, and the final results are shown in table 5.
TABLE 5
Salary | Research Area | Affiliation | Publication | λ | |
t1(D1) | 1 | 1 | 1 | 1 | 1 |
t2(D2) | 0 | 0 | 0 | 0 | 0 |
t3(D3) | 0 | 0 | 0 | 0 | 0 |
t4(D4) | 0 | 0 | 0 | 0 | 0 |
And fourthly, obtaining a result.
According to table 5, the best attribute value of the Salary and affinity attributes can be selected, i.e., the attribute value with the highest confidence level is the result. Wherein the Salary result is {142k } and the Aftilization result is { Amazon }.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (3)
1. A multi-source data set cleaning method based on predicates is characterized by comprising the following steps:
(1) constructing a predicate model: defining a priority predicate, a state predicate and an interaction predicate; wherein the content of the first and second substances,
and (3) priority predicates: for attribute AiAnd AjIf p isscore(Ai)<pscore(Aj) Then define a priority predicate Prior (A)i,Aj) Represents an attribute AiPriority of pscore(Ai) Higher than attribute AjPriority of pscore(Aj);Wherein, H (A)i) Represents attribute AiShannon entropy of pn(Ai) Represents attribute AiThe null value ratio among all the attribute values of (1);
the state predicates are as follows:wherein, tiThe expression is given to the sentence i,representing an attribute A in a statement ikThe value of the attribute of (a) is,representing predefinedAndwith respect to the condition, phi (t)i,tj) Representing a predefined tiAnd tjThe condition satisfied in (a); stat (A)k) When t is showniAnd tjWhen the conditions P and phi are satisfied, tiIs higher than tj;
The interaction predicates are as follows: interδ(A1,…,Al) Indicates that when the data satisfies the condition delta, the attribute A of the piece of data1,…,AlThe quality of the attribute value of (2) is poor;
(2) carrying out predicate mining on the data set to be cleaned through the predicate model defined in the step (1) to obtain a priority predicate, a state predicate and an interaction predicate in the data set;
(3) deducing the attribute value credibility of each data in the data set according to the obtained predicate, comprising the following steps:
(3-1) initializing the credibility of all attribute values of the data in the data set to be 0, and setting an influence factor η as a constant for each attribute value of each piece of data;
(3-2) updating the credibility of each attribute value of each piece of data by using the state predicates and the interaction predicates, wherein during updating, the state predicates are firstly used for updating and then the interaction predicates are used for updating, or the interaction predicates are firstly used for updating and then the state predicates are used for updating;
the method for updating the credibility of each attribute value of the data by using the state predicates comprises the following steps: enumerating two data t in the dataset two by twoiAnd tjIf t isiAnd tjAt attribute AkAnd (3) satisfying the state predicates:then the attribute value is addedη is subtracted;
the method for updating the credibility of each attribute value of the data by applying the interaction predicates comprises the following steps: traversing all data in the data set, if one data satisfies a certain interaction predicate Interδ(A1,…,Al) Then the piece of data attribute A is added1,…,Alη is subtracted from the confidence level of the attribute value of (a);
(3-3) after the step (2) is finished, updating the attribute value credibility of each piece of data by using the priority predicates, and executing the priority predicates in sequence from high priority to low priority during updating;
execution priority predicate Primary (A)i,Aj) Comprises the following steps: if multiple pieces of data are in attribute AjIf the confidence of the attribute values of (A) is the same, they are set to AiThe attribute values of (1) are sorted in ascending order according to the reliability of the attribute values, and the attribute values are sorted in the order of the sort in the A of the data of the nth bitjAdding n-1 to the credibility of the attribute value of (1);
(3-4) after the credibility of all the attribute values is obtained, returning all the attribute values with the credibility being more than or equal to the preset threshold value as results for the multi-value attribute; for the attribute which only needs to return one result, executing the steps (4) to (6);
(4) normalizing the credibility of all attribute values; according to the formulaCalculating the credibility of all data sources in the data set to be cleaned; wherein λ isiRepresenting a data source DiT denotes the data source DiD (t) represents the credibility of the data t, and the credibility of the data t is equal to the sum of the credibility of all attribute values of the data;
(5) according to the formulaλkUpdate the confidence of each attribute value, D' represents the confidence for attribute AjProviding attribute valuesThe data source of (1); returning to the step (4) after updating;
(6) repeatedly executing the steps (4) to (5) until the credibility of all the attribute values is converged; and for the attribute which only needs to return one result, finding out the attribute value with the highest credibility under the attribute as the final result.
2. The method for cleaning a multi-source data set based on a predicate of claim 1, wherein the state predicate and the interaction predicate are both obtained by a first-order logic predicate mining method.
3. The multi-source data set cleaning method based on predicates of claim 2, wherein before cleaning the data set, all attributes of all the data sets are manually marked, whether each attribute needs to return one result or a plurality of results is marked, if one attribute only needs to return one result, the attribute is marked as a single-value attribute, and the attribute value with the highest reliability under the attribute is taken as a final result during cleaning; if a plurality of results may exist in one attribute, marking the attribute as a multi-value attribute, and taking all attribute values with the credibility of the attribute larger than a preset threshold value as final results during cleaning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810578708.3A CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810578708.3A CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108776697A CN108776697A (en) | 2018-11-09 |
CN108776697B true CN108776697B (en) | 2020-06-09 |
Family
ID=64024668
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810578708.3A Active CN108776697B (en) | 2018-06-06 | 2018-06-06 | Multi-source data set cleaning method based on predicates |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108776697B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582906B (en) * | 2018-11-30 | 2021-06-15 | 北京锐安科技有限公司 | Method, device, equipment and storage medium for determining data reliability |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1811772A (en) * | 2005-01-25 | 2006-08-02 | 翁托普里塞有限公司 | Integration platform for heterogeneous information sources |
CN105045807A (en) * | 2015-06-04 | 2015-11-11 | 浙江力石科技股份有限公司 | Data cleaning algorithm based on Internet trading information |
CN105279232A (en) * | 2015-09-22 | 2016-01-27 | 武汉开目信息技术有限责任公司 | Method for showing screening and classification of data set in PDM (Product Data Management) system |
CN105608228A (en) * | 2016-01-29 | 2016-05-25 | 中国科学院计算机网络信息中心 | High-efficiency distributed RDF data storage method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8560523B2 (en) * | 2008-06-26 | 2013-10-15 | Microsoft Corporation | View matching of materialized XML views |
-
2018
- 2018-06-06 CN CN201810578708.3A patent/CN108776697B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1811772A (en) * | 2005-01-25 | 2006-08-02 | 翁托普里塞有限公司 | Integration platform for heterogeneous information sources |
CN105045807A (en) * | 2015-06-04 | 2015-11-11 | 浙江力石科技股份有限公司 | Data cleaning algorithm based on Internet trading information |
CN105279232A (en) * | 2015-09-22 | 2016-01-27 | 武汉开目信息技术有限责任公司 | Method for showing screening and classification of data set in PDM (Product Data Management) system |
CN105608228A (en) * | 2016-01-29 | 2016-05-25 | 中国科学院计算机网络信息中心 | High-efficiency distributed RDF data storage method |
Also Published As
Publication number | Publication date |
---|---|
CN108776697A (en) | 2018-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674850A (en) | Image description generation method based on attention mechanism | |
CN107480694B (en) | Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform | |
WO2021226809A1 (en) | Method and system for constructing knowledge map of manufacturing field | |
CN111712809A (en) | Learning ETL rules by example | |
US20140052695A1 (en) | Systems and methods for capturing data refinement actions based on visualized search of information | |
CN110020176A (en) | A kind of resource recommendation method, electronic equipment and computer readable storage medium | |
CN110389950B (en) | Rapid running big data cleaning method | |
CN110737805A (en) | Method and device for processing graph model data and terminal equipment | |
Greco et al. | Certain query answering in partially consistent databases | |
CN108776697B (en) | Multi-source data set cleaning method based on predicates | |
CN113032642A (en) | Data processing method, device and medium for target object and electronic equipment | |
CN112905906B (en) | Recommendation method and system fusing local collaboration and feature intersection | |
WO2012133941A1 (en) | Method for matching elements in schemas of databases using bayesian network | |
Malik et al. | A comprehensive approach towards data preprocessing techniques & association rules | |
CN114462894A (en) | Data analysis-based e-commerce order material replacement assistant decision method | |
Vats et al. | A junction tree framework for undirected graphical model selection | |
Meneghetti et al. | Output-sensitive evaluation of prioritized skyline queries | |
Lahijani | Semi-supervised data cleaning | |
Solanke et al. | Migration of relational database to MongoDB and Data Analytics using Naïve Bayes classifier based on Mapreduce approach | |
CN112579667B (en) | Data-driven engine multidisciplinary knowledge machine learning method and device | |
CN115328972B (en) | Smooth autoregressive radix estimation method | |
US11886404B2 (en) | Automated database modeling | |
CN117390064B (en) | Database query optimization method based on embeddable subgraph | |
Anam et al. | Schema mapping using hybrid ripple-down rules | |
Andreichicov et al. | Intelligent software for the quality management of the technical solutions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |