CN114281809A - Multi-source heterogeneous data cleaning method and device - Google Patents

Multi-source heterogeneous data cleaning method and device Download PDF

Info

Publication number
CN114281809A
CN114281809A CN202111577423.6A CN202111577423A CN114281809A CN 114281809 A CN114281809 A CN 114281809A CN 202111577423 A CN202111577423 A CN 202111577423A CN 114281809 A CN114281809 A CN 114281809A
Authority
CN
China
Prior art keywords
tuple
data
tuples
missing
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111577423.6A
Other languages
Chinese (zh)
Other versions
CN114281809B (en
Inventor
刘峰
张纪林
陈军相
袁俊峰
刘涛
金峻帆
钱瑞祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111577423.6A priority Critical patent/CN114281809B/en
Publication of CN114281809A publication Critical patent/CN114281809A/en
Application granted granted Critical
Publication of CN114281809B publication Critical patent/CN114281809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a multi-source heterogeneous data cleaning method and device, which are used for solving the problems of invalid and low-quality data repair caused by improper data cleaning sequence under multiple data quality dimensions. The invention starts from multiple data quality dimensions in the background of the smart campus, and guarantees the effectiveness of overall data cleaning by standardizing the data checking and repairing sequence. In the data repairing process, the currently known campus internal knowledge is used as an external constraint condition, a repairing rule set is expanded, and the accuracy of data cleaning is improved. In the intelligent campus construction process, the cleaned campus data can be effectively applied to all processes of data management, data opening, data mining and analysis and the like in colleges and universities. The consistency problem caused by data restoration under the condition of multiple data quality dimensions is avoided, and the data availability is greatly improved.

Description

Multi-source heterogeneous data cleaning method and device
Technical Field
The invention relates to the technical field of computers, in particular to a multi-source heterogeneous data cleaning method and device, and more particularly relates to a data inspection and repair method for data with integrity, consistency, uniqueness and other data quality problems in the field of data cleaning.
Background
With the rapid development of information technology, data is growing explosively in the background of the big data era. In the process of integrating multi-source heterogeneous data, any improper operation can cause a series of data quality problems. In the field of data mining, the data quality determines whether more valuable knowledge can be mined from massive and complex data, and therefore more reliable and accurate decision support is provided for users.
At present, the industry measures data quality mainly including six dimensions of completeness, consistency, uniqueness, accuracy, effectiveness, timeliness and the like. Most of the traditional research on data quality only aims at the data quality of a single dimension, or correlation existing among multiple dimensions of the data is ignored, so that the usability of the data after cleaning is low. Data in reality tend to be multidimensional, and each data dimension is not completely independent of each other. Therefore, the traditional single-dimension and simple data cleaning method and device are no longer suitable for solving the problem of multi-dimension data quality under the current complex scene.
Disclosure of Invention
Aiming at the problems, the invention provides a multi-source heterogeneous data cleaning method and device, and aims to solve the problem of multi-dimensional data quality caused by omission, constraint violation, repeated operation and the like when an operator collects and inputs data in real life. By the method and the device, data cleaning of data in three quality dimensions of integrity, consistency and uniqueness can be completed, and the usability of the data is improved.
In order to achieve the purpose, the invention provides a multi-source heterogeneous data cleaning method, which comprises the following specific steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples; wherein each tuple consists of a set of data of all attributes;
the multi-source means that the sources of the data have diversity, and the isomerism means that the types, the characteristics and the like of the data have difference;
step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigmacfdAnd external constraint ∑fcAdding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
the external constraints refer to various constraints such as hard constraints, quantity constraints and equivalence constraints on data which are set manually;
and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;
3-1 integrity check
Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set TLIf not, add to the complete tuple set TC
3-2 integrity repair
Traversing missing tuple sets T in sequenceLCheck if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ)cfdAnd/or external constraints ∑fc) Matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the improved KNN-based hybrid filling algorithm comprises the following specific steps:
1) dividing the non-missing data column of the current missing tuple into 5 types of missing subclass tuples, such as a numerical type (num), a binary type (dual), an ordinal type (ordi), a classification type (category), a text type (text) and the like;
2) set the complete tuples TCDividing the same data columns corresponding to each type of subclass tuple in the current missing tuple into 5 types of complete subclass tuple sets;
3) respectively calculating the subclass distance between each type of missing subclass tuple and the complete subclass tuple;
for the numerical subclass tuple, calculating a subclass distance between the missing subclass tuple and the complete subclass tuple by using a standardized Euclidean distance formula (1);
Figure BDA0003425739410000021
where n represents the total number of numeric data in the subclass tuple, xLiRepresents the ith data, x, of the missing subclass tupleCiRepresenting the ith data, s, of a complete sub-class tupleiRepresenting the standard deviation of all values of ith column data of the subclass tuple;
for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);
Figure BDA0003425739410000022
if two values of the binary data are respectively considered as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of data in the missing subclass tuple data, which is 0, and the data in the complete subclass tuple, which is 1, r represents the number of data in the missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;
for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);
Figure BDA0003425739410000031
D(L,C)ordi=D(L,C)numformula (4)
Wherein, if all values of the ith row data of the ordinal type sub-tuple are sequentially regarded as a sequence from 0 to N, N isiIndicates the total number N, M of serial numbers of the ith row dataiIndicating the number of values of the data in the sequence, XiRepresenting the converted numerical data;
for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);
Figure BDA0003425739410000032
the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;
for the text type subclass tuples, calculating the distance between character string data by using an edit distance formula (6), and then calculating the subclass distance between the missing subclass tuples and the complete subclass tuples by using a formula (7) and carrying out normalization processing;
Figure BDA0003425739410000033
Figure BDA0003425739410000034
wherein D isi(L,C)textIndicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuplej、CkRespectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple (j is more than or equal to 0 and less than or equal to Ui,0≤k≤Vi) Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuplei、ViRespectively representing the total length of ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein Max represents a maximum function;
4) computing missing tuples t1And complete tuple t2Tuple distance between;
missing tuple t1And complete tuple t2The tuple distances between them are multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W, respectivelyiAdding the obtained products, and obtaining the product by the formula (8) and the formula (9);
Figure BDA0003425739410000041
Figure BDA0003425739410000042
where i represents a subclass tuple of 5 types, WiWeight coefficient, D, representing the i-th type of sub-class tuple in the current tuplei(t1,t2) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, YiRepresenting the number of ith type data in the current tuple;
5) sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;
6) selecting the first k complete tuples with the minimum tuple distance as a target tuple set;
the k value is obtained by training, and the specific steps are as follows:
6-1) dividing all complete tuples into a test tuple set and a training tuple set;
6-2) dividing the training tuple set into n sub-tuple sets with the same size;
6-3) taking each sub-tuple set as a complete tuple set T in turnCRepairing the current missing tuple using 1 to 100 as training k values, respectively;
6-4) obtaining the k value with the highest repairing accuracy in each sub-tuple set;
6-5) taking the average value of the n k values as a repairing k value of the test tuple set;
in order to ensure that the value of a training k value is between 1 and 100, the size n of each sub-tuple set is not less than 100 when the training tuple sets are divided;
7) selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of the missing tuple missing data;
and 4, step 4: carrying out consistency check and consistency repair on all tuples in the data processed in the step 3;
4-1 consistency check
Sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, checking the rule violated by the current tuple (namely, conditional function dependence sigma)cfdAnd/or external constraints ∑fc) Adding to an abnormal rule set Σ';
4-2 consistency repair
The consistency restoration mainly comprises 3 processes of determining a rule restoration sequence, positioning an abnormal tuple and selecting a target tuple;
4-2-1 determining a rule repair order
1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V ═ Σ'; for any two conditional function dependencies
Figure BDA0003425739410000051
If it is not
Figure BDA0003425739410000052
Then
Figure BDA0003425739410000053
There is a route between
Figure BDA0003425739410000054
Point of direction
Figure BDA0003425739410000055
Is namely
Figure BDA0003425739410000056
There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;
2) sequentially selecting nodes with the degree of income of 0 (namely conditional function dependence) in the rule sequence diagram as priority repair rules, and adding the priority repair rules to the repair rule set sigmarepThen deleting the node and the edges connected with the node until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum sum of repair costs from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigmarep
The in degree is 0, namely that no edge points to the node in the rule sequence diagram;
the repair cost refers to the total number of times of modifying tuple data generated when one tuple is used for performing consistency repair on all abnormal tuples violating the current rule;
4-2-2 location anomaly tuples
Sequentially traversing the repair rule set sigmarepAll of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set Te
4-2-3 select target tuples
Combining external constraint rules Σ in rule sets ΣfcIn the abnormal tuple set TeIs selected to haveThe tuple with the minimum repairing cost is used as a target tuple, and other abnormal tuples are repaired by using the target tuple;
and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4
Checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step 4; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, repeating the steps until all tuples finish uniqueness check, and realizing data uniqueness check and repair;
the improved SNM algorithm based on the hybrid distance and the dynamic window specifically comprises the following steps:
5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;
5-2) sorting all tuples according to the sorting keywords;
5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window according to a formula (8), deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold, and otherwise moving the sliding window by one step length to move the first tuple out of the sliding window and move the next tuple of the last tuple in the sliding window in;
5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, and taking the ratio as the window average density, if the window average density is smaller than a density threshold value, increasing the size of the sliding window, if the window average density is equal to the density threshold value, keeping the size of the sliding window unchanged, if the window average density is larger than the density threshold value, reducing the size of the sliding window, and continuing to slide until all tuple inspection is finished;
step 6: rechecking the data processed in the step 5 to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data set, and if not, returning to the step 4-2 to continue the execution; .
In order to achieve the purpose, the invention also provides a multi-source heterogeneous data cleaning device, which comprises the following specific modules:
the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
The technical scheme of the invention has the following advantages:
1. compared with the traditional single-dimension data cleaning, the method provided by the invention starts from three data quality dimensions of integrity, consistency and uniqueness, and designs a cleaning method and steps for data of each dimension respectively, so that the overall quality of multi-dimensional data is improved.
2. Compared with the traditional data cleaning which only depends on the condition function, the invention not only uses the condition function dependence existing among the data, but also uses the external constraint condition, expands the rule set of the data cleaning, and improves the data quality detection and repair effect.
3. Compared with the traditional single type data cleaning, the data cleaning method can solve the data quality problems of five mixed types such as numerical type, binary type, ordinal type, classification type and text type, and the like, and selects a proper distance measurement formula for each type of data, so that the accuracy of data cleaning is improved.
4. Compared with the traditional data cleaning device, the invention avoids the influence of integrity repair on consistency repair and uniqueness repair and the influence of consistency repair on uniqueness repair by designing the standardized data cleaning device, and ensures the effectiveness of data cleaning.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a multi-source heterogeneous data cleaning method according to an embodiment of the present invention;
FIG. 2 is a flow chart of integrity check and repair in an embodiment of the present invention;
FIG. 3 is a flow chart of consistency checking and repair in an embodiment of the present invention;
FIG. 4 is a flow chart of uniqueness checking and repairing in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a dynamic sliding window in an embodiment of the present invention;
FIG. 6 is a block diagram of a multi-source heterogeneous data cleaning apparatus according to an embodiment of the present disclosure;
Detailed Description
In order to fully and clearly communicate the technical solutions of the embodiments of the present invention to those skilled in the art, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As shown in fig. 1, a multi-source heterogeneous data cleaning method provided in an embodiment of the present invention includes the following steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples, wherein one tuple is formed by a group of data with all attributes;
in the step, the data source is a database of each business system in the campus, and the database comprises multi-source heterogeneous data such as student basic information data, score data, library access data, campus card consumption data and the like. Firstly, an extraction task is created by using a key (ETL tool), connection configuration information of a source database and a target database is set, then a conversion task is created to convert fields with the same attribute in all tables into a uniform data format, and finally the extraction task and the conversion task are added to a job and executed to obtain an initial data set.
Step 2: constructing conditional function dependence existing among different attributes for the initial data acquired in the step 1, and then enabling the conditional function dependence to be sigmacfdAnd external constraint ∑fcAdding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
in this step, first, a corresponding conditional function dependency is established between attribute fields having an association relationship in all data tables, for example, the personal identification number may determine the age, date of birth, etc., and then added to the rule set. Secondly, some external constraint conditions which can be artificially determined in the business department, such as the number of students in each province, the proportion of men and women, and the like of a certain college, are also added into the rule set. The rule set is specifically defined as follows:
given student basic information data instance I: (school number, name, age, date of birth, identification number, province, city, zip code), conditional function dependency set
Figure BDA0003425739410000081
External constraint set Σfc=∑ψiThen rule set Σ ═ Σcfd∪Σfc. For conditional function dependence
Figure BDA0003425739410000082
X, Y are different attribute fields in the data table, meaning for any two tuples (t)1,t2) If t is1[X]=t2[X]Then t1[Y]=t2[Y]. On the contrary, if t1[X]=t2[X]But t is1[Y]≠t2[Y]Then t1And t2Tuple on rule
Figure BDA0003425739410000083
There is a consistency error.
For example, for a student basic information table, the first conditional function that can be established depends on the following:
Figure BDA0003425739410000084
identification number → age, date of birth
Figure BDA0003425739410000085
Zip code → city, province
Figure BDA0003425739410000086
School code → name
Secondly, by knowing the enrollment information of schools and computer schools, the external constraints that can be determined are as follows:
ψ1: the number of students in Hangzhou school is not more than 100
ψ2: the ratio of male to female in computer school is not less than 3:1
And finally, combining the conditional function dependence and the external constraint condition to obtain a required rule set. It should be noted that the rules described above, such as conditional function dependencies and external constraints, are only used to describe the establishment of the rule set, and are not used to limit the rule set.
And step 3: sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set TLIf not, add to the complete tuple set TC. Then using the rule set, complete tuple set T of step 2CRepairing the missing tuples based on the improved KNN mixed filling algorithm, wherein the integrity repairing process is shown in FIG. 2;
the step specifically includes two processes of integrity check and integrity repair.
And (4) integrity checking: sequentially traversing all tuples in the data, judging whether the current tuple has deficiency, if so, adding the current tuple to the deficiency tuple set TLIf not, add to the complete tuple set TCAll tuples containing missing data are detected.
And (3) integrity repair: after the integrity check, sequentially traversing the missing tuple sets TLCheck if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ)cfdAnd/or external constraints ∑fc) And matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on the improved KNN.
And 4, step 4: and (3) traversing all tuples of the data processed in the step (3) in sequence, checking whether the tuples are matched with all rules in the rule set in the step (2), and recording the conditional function dependence and/or external constraint violated by the tuples with consistency errors. Then, repairing the error data of the error tuple according to the rule repairing sequence and the target tuple, and realizing the checking and repairing of the data consistency, wherein the consistency repairing process is shown in fig. 3;
the step specifically comprises two processes of consistency check and consistency repair.
And (3) checking consistency: sequentially traversing all tuples in the data, checking whether the current tuple is matched with all rules in the rule set sigma in the step 2, if so, continuously checking the next tuple, and otherwise, checking the rule violated by the current tuple (conditional function depends on sigma)cfdAnd/or external constraints ∑fc) Adding to an abnormal rule set Σ';
and (3) consistency repair: the method specifically comprises 3 processes of determining a rule repairing sequence, positioning an abnormal tuple and selecting a target tuple.
And determining a rule repairing sequence, wherein in view of the fact that the different conditional function dependencies in the abnormal rule set may contain the same attribute field, it is required to determine which rule sequence to repair, otherwise, the wrong repairing may be caused. In a specific implementation, the rule repair order is determined by constructing a rule sequence diagram, then performing topology sorting on the rule sequence diagram, sequentially selecting nodes (conditional function dependencies) with an in-degree of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule sequence set ΣrepThe node and the edges connected to the node are then deleted until there are no more nodes remaining in the graph. If the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a rule sequence with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set;
locating abnormal tuples and traversing the repairing rule set sigma in sequencerepAll of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set Te
Selecting the target tuple, the selection of the abnormal data target value is a key problem for consistency repair. Giving abnormal tuple set, selecting different repairing target values, and judging whether the repairing result is large or notAnd the corresponding repair cost is different. In a specific implementation, it is necessary to combine the external constraints Σ in the rule set ΣfcIn the abnormal tuple set TeOne tuple with the smallest repair cost is selected as a target tuple to repair other abnormal tuples.
And 5: and (4) checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step (4). If so, the two tuple data are considered to be repeated, and the repeated tuple in the window is deleted; and if not, the first tuple and other tuples are considered to meet the uniqueness condition, and the first tuple in the sliding window is moved out and the next tuple of the last tuple in the window is moved in. Repeating the steps until the uniqueness check of all the tuples is completed, wherein the uniqueness repair flow is shown in FIG. 4;
in this step, first, one or a group of data is selected for all tuples in the data set, and a key value of each tuple is calculated and used as a sorting key of the tuple.
Secondly, all the tuples are sorted according to the sorting key, and the tuples with similar and repeated data are adjacent in sequence.
Then, a sliding window with an initial size of N is set on the sorted tuples (as shown in fig. 5), the tuple distance between the first tuple in the window and the other N-1 tuples in the window is calculated, and if the tuple distance between a certain tuple and the first tuple is smaller than the set distance threshold, the similar duplicate tuple is deleted.
And finally, moving a sliding window step length, moving out the first tuple in the sliding window and moving in the next tuple of the last tuple, and repeating the steps until all tuples in the data are checked.
In the process of sliding the window, the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the window is calculated and used as the average density of the window, if the average density of the window is higher than a set density threshold value, the similarity between the tuples in the sliding window is considered to be lower, the size of the sliding window can be properly reduced to reduce the comparison times, and the repair efficiency is improved. On the contrary, if the average density of the window is lower than the set density threshold, the similarity between the elements in the sliding window is considered to be higher, and the size of the sliding window can be properly increased to expand the matching range and improve the repairing accuracy.
In order to further reduce matching errors among all tuples, new sorting keywords can be reselected to perform sorting, checking and repairing again, similar repeated tuples on data are deleted as far as possible through a multiple sliding window detection mechanism, and the accuracy of uniqueness checking and repairing is improved.
Step 6: and (4) rechecking whether all tuples in the data processed in the step (5) are matched with all rules in the rule set, if so, all tuples meet the consistency condition, finishing the cleaning of the data, and if not, returning to the step (4-2) to continue executing.
As shown in fig. 6, in an embodiment of the present invention, there is also provided a multi-source heterogeneous data cleaning apparatus, including: the system comprises a data acquisition and preprocessing module, a rule set construction module, an integrity checking and repairing module, a consistency checking and repairing module, a uniqueness checking and repairing module and a consistency secondary checking module.
The data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
The above are merely specific embodiments of the present invention, and are not intended to limit the present invention. It will be apparent to those skilled in the art that the present application is susceptible to many modifications and variations in view of the specific application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A multi-source heterogeneous data cleaning method is characterized by comprising the following steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples;
step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigmacfdAnd external constraint ∑fcAdding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;
3-1 integrity check
Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set TLIf not, add to the complete tuple set TC
3-2 integrity repair
Traversing missing tuple sets T in sequenceLChecking whether the missing items of the current missing tuples are matched with some rules in the rule set sigma or not, if so, filling the missing data of the current missing tuples by using the rules, otherwise, filling the missing data of the current missing tuples by using a modified KNN-based mixed filling algorithm;
the improved KNN-based hybrid filling algorithm comprises the following specific steps:
1) dividing the non-missing data column of the current missing tuple into 5 types of missing tuples, namely numerical type, binary type, ordinal type, classification type and text type;
2) set the complete tuples TCDividing the same data column corresponding to each type of tuple in the current missing tuples into 5 types of complete tuple sets;
3) respectively calculating the subclass distance between each type of missing tuple and the complete tuple;
4) computing missing tuples t1And complete tuple t2Tuple distance between;
5) sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;
6) selecting the first k complete tuples with the minimum tuple distance as a target tuple set;
7) selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of missing data of the missing tuples;
and 4, step 4: carrying out consistency check and consistency repair on all tuples in the data processed in the step 3;
4-1 consistency check:
sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, adding the rule violated by the current tuple to the abnormal rule set sigma';
4-2, consistency repair;
and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4;
step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data, and if not, returning to the step (4-2) to continue the execution.
2. The method for cleaning multi-source heterogeneous data according to claim 1, wherein the step 3) in the integrity repair of the step 3-2 is specifically:
for the numerical subclass tuple, calculating a subclass distance between the missing subclass tuple and the complete subclass tuple by using a standardized Euclidean distance formula (1);
Figure FDA0003425739400000021
where n represents the total number of numeric data in the subclass tuple, xLiRepresents the ith data, x, of the missing subclass tupleCiRepresenting the ith data, s, of a complete sub-class tupleiRepresenting the standard deviation of all values of ith column data of the subclass tuple;
for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);
Figure FDA0003425739400000022
if two values of the binary data are respectively considered as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of data in the missing subclass tuple data, which is 0, and the data in the complete subclass tuple, which is 1, r represents the number of data in the missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;
for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);
Figure FDA0003425739400000023
D(L,C)ordi=D(L,C)numformula (4)
Wherein, if all values of the ith column data of the subclass tuple are sequentially regarded as a sequence from 0 to N, then N isiIndicates the total number N, M of serial numbers of the ith row dataiIndicating the number of values of the data in the sequence, XiRepresenting the converted numerical data;
for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);
Figure FDA0003425739400000031
the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;
for the text type subclass tuples, calculating the distance between character string data by using an edit distance formula (6), and then calculating the subclass distance between the missing subclass tuples and the complete subclass tuples by using a formula (7) and carrying out normalization processing;
Figure FDA0003425739400000032
Figure FDA0003425739400000033
wherein D isi(L,C)textIndicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuplej、CkRespectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein j is more than or equal to 0 and is less than or equal to Ui,0≤k≤ViMin represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuplei、ViThe total length of the ith string data in the missing sub-class tuple and the complete sub-class tuple is represented respectively, and Max represents a maximum function.
3. The method for cleaning multi-source heterogeneous data according to claim 2, wherein the step 4) in the integrity repair in the step 3-2 is specifically:
missing tuple t1And complete tuple t2The tuple distances between them are multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W, respectivelyiAdding the obtained products, and obtaining the product by the formula (8) and the formula (9);
Figure FDA0003425739400000034
Figure FDA0003425739400000035
where i represents a subclass tuple of 5 types, WiWeight coefficient, D, representing the i-th type of sub-class tuple in the current tuplei(t1,t2) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y represents in the current tupleTotal number of data, YiRepresenting the number of i-th type data in the current tuple.
4. The method according to claim 3, wherein the k value in step 6) of the integrity repair in step 3-2 is obtained by:
6-1) dividing all complete tuples into a test tuple set and a training tuple set;
6-2) dividing the training tuple set into n sub-tuple sets with the same size;
6-3) taking each sub-tuple set as a complete tuple set T in turnCRepairing the current missing tuple using 1 to 100 as training k values, respectively;
6-4) obtaining the k value with the highest repairing accuracy in each sub-tuple set;
6-5) the average of these n k values is used as the repair k value for the set of test tuples.
5. The method according to claim 1, wherein the consistency repair in step 4 mainly comprises determining a rule repair order, locating an abnormal tuple, and selecting a target tuple; the method comprises the following steps:
4-2-1 determining a rule repair order
1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V ═ Σ'; for any two conditional function dependencies
Figure FDA0003425739400000041
If it is not
Figure FDA0003425739400000042
Then
Figure FDA0003425739400000043
There is a route between
Figure FDA0003425739400000044
Point of direction
Figure FDA0003425739400000045
Is namely
Figure FDA0003425739400000046
There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;
2) sequentially selecting nodes with the degree of income of 0 (namely conditional function dependence) in the rule sequence diagram as priority repair rules, and adding the priority repair rules to the repair rule set sigmarepThen deleting the node and the edges connected with the node until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum sum of repair costs from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigmarep
4-2-2 location anomaly tuples
Sequentially traversing the repair rule set sigmarepAll of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set Te
4-2-3 select target tuples
Combining external constraint rules Σ in rule sets ΣfcIn the abnormal tuple set TeAnd selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple.
6. The method for cleaning multi-source heterogeneous data according to claim 1, wherein in step 5, an improved SNM algorithm based on a mixed distance and a dynamic window is used for the data processed in step 4, and whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold is checked; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; and moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, and repeating the steps until all tuples finish uniqueness check, so as to realize data uniqueness check and repair.
7. The method for cleaning multi-source heterogeneous data according to claim 6, wherein the improved SNM algorithm based on the hybrid distance and the dynamic window comprises the following specific steps:
5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;
5-2) sorting all tuples according to the sorting keywords;
5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window, deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold value, and otherwise moving the sliding window by one step length;
5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, taking the ratio as the window average density, increasing the size of the sliding window if the window average density is smaller than a density threshold, keeping the size of the sliding window unchanged if the window average density is equal to the density threshold, decreasing the size of the sliding window if the window average density is larger than the density threshold, and continuing to slide until all tuple inspection is finished.
8. A multi-source heterogeneous data cleaning device is characterized by comprising the following modules:
the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
CN202111577423.6A 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device Active CN114281809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111577423.6A CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111577423.6A CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Publications (2)

Publication Number Publication Date
CN114281809A true CN114281809A (en) 2022-04-05
CN114281809B CN114281809B (en) 2023-03-28

Family

ID=80873920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111577423.6A Active CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Country Status (1)

Country Link
CN (1) CN114281809B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625525A (en) * 2020-05-26 2020-09-04 哈尔滨工业大学 Environmental data repairing/filling method and system
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method
CN115713270A (en) * 2022-11-28 2023-02-24 之江实验室 Method and device for detecting and correcting evaluation abnormality of same-bank mutual evaluation
CN116578557A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Missing data filling method for data center

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027542A1 (en) * 2003-07-28 2005-02-03 International Business Machines Corporation Method and system for detection of integrity constraint violations
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
CN108446294A (en) * 2018-01-22 2018-08-24 东华大学 A kind of cleaning rule digging system towards dirty data
CN109885561A (en) * 2019-01-03 2019-06-14 中国人民解放军国防科技大学 Inconsistent data cleaning method based on maximum dependency set and attribute correlation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027542A1 (en) * 2003-07-28 2005-02-03 International Business Machines Corporation Method and system for detection of integrity constraint violations
US20090006302A1 (en) * 2007-06-29 2009-01-01 Wenfei Fan Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies
CN108446294A (en) * 2018-01-22 2018-08-24 东华大学 A kind of cleaning rule digging system towards dirty data
CN109885561A (en) * 2019-01-03 2019-06-14 中国人民解放军国防科技大学 Inconsistent data cleaning method based on maximum dependency set and attribute correlation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GAO CONG等: "Improving Data: Quality Consistency and Accuracy", 《EDINBURGH RESEARCH EXPLORER》 *
余伟: "Web大数据环境下的不一致跨源数据发现", 《计算机研究与发展》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625525A (en) * 2020-05-26 2020-09-04 哈尔滨工业大学 Environmental data repairing/filling method and system
CN111625525B (en) * 2020-05-26 2023-05-26 哈尔滨工业大学 Environment data repairing/filling method and system
CN115543977A (en) * 2022-09-29 2022-12-30 河北雄安睿天科技有限公司 Water supply industry data cleaning method
CN115713270A (en) * 2022-11-28 2023-02-24 之江实验室 Method and device for detecting and correcting evaluation abnormality of same-bank mutual evaluation
CN116578557A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Missing data filling method for data center
CN116578557B (en) * 2023-03-03 2024-04-02 齐鲁工业大学(山东省科学院) Missing data filling method for data center

Also Published As

Publication number Publication date
CN114281809B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
Zhang et al. Community detection in networks with node features
Ilyas et al. Data cleaning
US7814111B2 (en) Detection of patterns in data records
CN106294762B (en) Entity identification method based on learning
US20200394201A1 (en) Automatic modeling method and classifier for olap data model
CN111338950A (en) Software defect feature selection method based on spectral clustering
CN112364352A (en) Interpretable software vulnerability detection and recommendation method and system
CN110569289A (en) Column data processing method, equipment and medium based on big data
WO2021114483A1 (en) Method for automatically identifying design change in building information model
CN113742396B (en) Mining method and device for object learning behavior mode
Winkler Cleaning and using administrative lists: Enhanced practices and computational algorithms for record linkage and modeling/editing/imputation
Cheng et al. Mofsrank: a multiobjective evolutionary algorithm for feature selection in learning to rank
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
Shi et al. Research on Fast Recommendation Algorithm of Library Personalized Information Based on Density Clustering.
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
Przybyła-Kasperek Study of selected methods for balancing independent data sets in k-nearest neighbors classifiers with Pawlak conflict analysis
Nimmagadda et al. Implementation of Clustering Through Machine Learning Tool
Hadžić et al. Different similarity measures to identify duplicate records in relational databases
CN109977269B (en) Data self-adaptive fusion method for XML file
CN110704522B (en) Concept data model automatic conversion method based on semantic analysis
Ali et al. Duplicates detection within incomplete data sets using blocking and dynamic sorting key methods
CN112650770B (en) MySQL parameter recommendation method based on query work load analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant