CN114281809A

CN114281809A - Multi-source heterogeneous data cleaning method and device

Info

Publication number: CN114281809A
Application number: CN202111577423.6A
Authority: CN
Inventors: 刘峰; 张纪林; 陈军相; 袁俊峰; 刘涛; 金峻帆; 钱瑞祥
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-05
Anticipated expiration: 2041-12-22
Also published as: CN114281809B

Abstract

The invention discloses a multi-source heterogeneous data cleaning method and device, which are used for solving the problems of invalid and low-quality data repair caused by improper data cleaning sequence under multiple data quality dimensions. The invention starts from multiple data quality dimensions in the background of the smart campus, and guarantees the effectiveness of overall data cleaning by standardizing the data checking and repairing sequence. In the data repairing process, the currently known campus internal knowledge is used as an external constraint condition, a repairing rule set is expanded, and the accuracy of data cleaning is improved. In the intelligent campus construction process, the cleaned campus data can be effectively applied to all processes of data management, data opening, data mining and analysis and the like in colleges and universities. The consistency problem caused by data restoration under the condition of multiple data quality dimensions is avoided, and the data availability is greatly improved.

Description

Multi-source heterogeneous data cleaning method and device

Technical Field

The invention relates to the technical field of computers, in particular to a multi-source heterogeneous data cleaning method and device, and more particularly relates to a data inspection and repair method for data with integrity, consistency, uniqueness and other data quality problems in the field of data cleaning.

Background

With the rapid development of information technology, data is growing explosively in the background of the big data era. In the process of integrating multi-source heterogeneous data, any improper operation can cause a series of data quality problems. In the field of data mining, the data quality determines whether more valuable knowledge can be mined from massive and complex data, and therefore more reliable and accurate decision support is provided for users.

At present, the industry measures data quality mainly including six dimensions of completeness, consistency, uniqueness, accuracy, effectiveness, timeliness and the like. Most of the traditional research on data quality only aims at the data quality of a single dimension, or correlation existing among multiple dimensions of the data is ignored, so that the usability of the data after cleaning is low. Data in reality tend to be multidimensional, and each data dimension is not completely independent of each other. Therefore, the traditional single-dimension and simple data cleaning method and device are no longer suitable for solving the problem of multi-dimension data quality under the current complex scene.

Disclosure of Invention

Aiming at the problems, the invention provides a multi-source heterogeneous data cleaning method and device, and aims to solve the problem of multi-dimensional data quality caused by omission, constraint violation, repeated operation and the like when an operator collects and inputs data in real life. By the method and the device, data cleaning of data in three quality dimensions of integrity, consistency and uniqueness can be completed, and the usability of the data is improved.

In order to achieve the purpose, the invention provides a multi-source heterogeneous data cleaning method, which comprises the following specific steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples; wherein each tuple consists of a set of data of all attributes;

the multi-source means that the sources of the data have diversity, and the isomerism means that the types, the characteristics and the like of the data have difference;

step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigma_cfdAnd external constraint ∑_fcAdding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;

the external constraints refer to various constraints such as hard constraints, quantity constraints and equivalence constraints on data which are set manually;

and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;

3-1 integrity check

Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T_LIf not, add to the complete tuple set T_C；

3-2 integrity repair

Traversing missing tuple sets T in sequence_LCheck if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ)_cfdAnd/or external constraints ∑_fc) Matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;

the improved KNN-based hybrid filling algorithm comprises the following specific steps:

1) dividing the non-missing data column of the current missing tuple into 5 types of missing subclass tuples, such as a numerical type (num), a binary type (dual), an ordinal type (ordi), a classification type (category), a text type (text) and the like;

2) set the complete tuples T_CDividing the same data columns corresponding to each type of subclass tuple in the current missing tuple into 5 types of complete subclass tuple sets;

3) respectively calculating the subclass distance between each type of missing subclass tuple and the complete subclass tuple;

for the numerical subclass tuple, calculating a subclass distance between the missing subclass tuple and the complete subclass tuple by using a standardized Euclidean distance formula (1);

where n represents the total number of numeric data in the subclass tuple, x_LiRepresents the ith data, x, of the missing subclass tuple_CiRepresenting the ith data, s, of a complete sub-class tuple_iRepresenting the standard deviation of all values of ith column data of the subclass tuple;

for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);

if two values of the binary data are respectively considered as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of data in the missing subclass tuple data, which is 0, and the data in the complete subclass tuple, which is 1, r represents the number of data in the missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;

for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);

D(L,C)_ordi＝D(L,C)_numformula (4)

Wherein, if all values of the ith row data of the ordinal type sub-tuple are sequentially regarded as a sequence from 0 to N, N is_iIndicates the total number N, M of serial numbers of the ith row data_iIndicating the number of values of the data in the sequence, X_iRepresenting the converted numerical data;

for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);

the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;

for the text type subclass tuples, calculating the distance between character string data by using an edit distance formula (6), and then calculating the subclass distance between the missing subclass tuples and the complete subclass tuples by using a formula (7) and carrying out normalization processing;

wherein D is_i(L,C)_textIndicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple_j、C_kRespectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple (j is more than or equal to 0 and less than or equal to U_i,0≤k≤V_i) Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple_i、V_iRespectively representing the total length of ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein Max represents a maximum function;

4) computing missing tuples t₁And complete tuple t₂Tuple distance between;

missing tuple t₁And complete tuple t₂The tuple distances between them are multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W, respectively_iAdding the obtained products, and obtaining the product by the formula (8) and the formula (9);

where i represents a subclass tuple of 5 types, W_iWeight coefficient, D, representing the i-th type of sub-class tuple in the current tuple_i(t₁,t₂) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, Y_iRepresenting the number of ith type data in the current tuple;

5) sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;

6) selecting the first k complete tuples with the minimum tuple distance as a target tuple set;

the k value is obtained by training, and the specific steps are as follows:

6-1) dividing all complete tuples into a test tuple set and a training tuple set;

6-2) dividing the training tuple set into n sub-tuple sets with the same size;

6-3) taking each sub-tuple set as a complete tuple set T in turn_CRepairing the current missing tuple using 1 to 100 as training k values, respectively;

6-4) obtaining the k value with the highest repairing accuracy in each sub-tuple set;

6-5) taking the average value of the n k values as a repairing k value of the test tuple set;

in order to ensure that the value of a training k value is between 1 and 100, the size n of each sub-tuple set is not less than 100 when the training tuple sets are divided;

7) selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of the missing tuple missing data;

and 4, step 4: carrying out consistency check and consistency repair on all tuples in the data processed in the step 3;

4-1 consistency check

Sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, checking the rule violated by the current tuple (namely, conditional function dependence sigma)_cfdAnd/or external constraints ∑_fc) Adding to an abnormal rule set Σ';

4-2 consistency repair

The consistency restoration mainly comprises 3 processes of determining a rule restoration sequence, positioning an abnormal tuple and selecting a target tuple;

4-2-1 determining a rule repair order

1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V ═ Σ'; for any two conditional function dependencies

If it is not

Then

There is a route between

Point of direction

Is namely

There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;

2) sequentially selecting nodes with the degree of income of 0 (namely conditional function dependence) in the rule sequence diagram as priority repair rules, and adding the priority repair rules to the repair rule set sigma_repThen deleting the node and the edges connected with the node until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum sum of repair costs from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigma_rep；

The in degree is 0, namely that no edge points to the node in the rule sequence diagram;

the repair cost refers to the total number of times of modifying tuple data generated when one tuple is used for performing consistency repair on all abnormal tuples violating the current rule;

4-2-2 location anomaly tuples

Sequentially traversing the repair rule set sigma_repAll of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T_e；

4-2-3 select target tuples

Combining external constraint rules Σ in rule sets Σ_fcIn the abnormal tuple set T_eIs selected to haveThe tuple with the minimum repairing cost is used as a target tuple, and other abnormal tuples are repaired by using the target tuple;

and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4

Checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step 4; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, repeating the steps until all tuples finish uniqueness check, and realizing data uniqueness check and repair;

the improved SNM algorithm based on the hybrid distance and the dynamic window specifically comprises the following steps:

5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;

5-2) sorting all tuples according to the sorting keywords;

5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window according to a formula (8), deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold, and otherwise moving the sliding window by one step length to move the first tuple out of the sliding window and move the next tuple of the last tuple in the sliding window in;

5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, and taking the ratio as the window average density, if the window average density is smaller than a density threshold value, increasing the size of the sliding window, if the window average density is equal to the density threshold value, keeping the size of the sliding window unchanged, if the window average density is larger than the density threshold value, reducing the size of the sliding window, and continuing to slide until all tuple inspection is finished;

step 6: rechecking the data processed in the step 5 to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data set, and if not, returning to the step 4-2 to continue the execution; .

In order to achieve the purpose, the invention also provides a multi-source heterogeneous data cleaning device, which comprises the following specific modules:

the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;

the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;

the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;

the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;

the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;

and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.

The technical scheme of the invention has the following advantages:

1. compared with the traditional single-dimension data cleaning, the method provided by the invention starts from three data quality dimensions of integrity, consistency and uniqueness, and designs a cleaning method and steps for data of each dimension respectively, so that the overall quality of multi-dimensional data is improved.

2. Compared with the traditional data cleaning which only depends on the condition function, the invention not only uses the condition function dependence existing among the data, but also uses the external constraint condition, expands the rule set of the data cleaning, and improves the data quality detection and repair effect.

3. Compared with the traditional single type data cleaning, the data cleaning method can solve the data quality problems of five mixed types such as numerical type, binary type, ordinal type, classification type and text type, and the like, and selects a proper distance measurement formula for each type of data, so that the accuracy of data cleaning is improved.

4. Compared with the traditional data cleaning device, the invention avoids the influence of integrity repair on consistency repair and uniqueness repair and the influence of consistency repair on uniqueness repair by designing the standardized data cleaning device, and ensures the effectiveness of data cleaning.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a multi-source heterogeneous data cleaning method according to an embodiment of the present invention;

FIG. 2 is a flow chart of integrity check and repair in an embodiment of the present invention;

FIG. 3 is a flow chart of consistency checking and repair in an embodiment of the present invention;

FIG. 4 is a flow chart of uniqueness checking and repairing in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a dynamic sliding window in an embodiment of the present invention;

FIG. 6 is a block diagram of a multi-source heterogeneous data cleaning apparatus according to an embodiment of the present disclosure;

Detailed Description

In order to fully and clearly communicate the technical solutions of the embodiments of the present invention to those skilled in the art, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, a multi-source heterogeneous data cleaning method provided in an embodiment of the present invention includes the following steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples, wherein one tuple is formed by a group of data with all attributes;

in the step, the data source is a database of each business system in the campus, and the database comprises multi-source heterogeneous data such as student basic information data, score data, library access data, campus card consumption data and the like. Firstly, an extraction task is created by using a key (ETL tool), connection configuration information of a source database and a target database is set, then a conversion task is created to convert fields with the same attribute in all tables into a uniform data format, and finally the extraction task and the conversion task are added to a job and executed to obtain an initial data set.

Step 2: constructing conditional function dependence existing among different attributes for the initial data acquired in the step 1, and then enabling the conditional function dependence to be sigma_cfdAnd external constraint ∑_fcAdding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;

in this step, first, a corresponding conditional function dependency is established between attribute fields having an association relationship in all data tables, for example, the personal identification number may determine the age, date of birth, etc., and then added to the rule set. Secondly, some external constraint conditions which can be artificially determined in the business department, such as the number of students in each province, the proportion of men and women, and the like of a certain college, are also added into the rule set. The rule set is specifically defined as follows:

given student basic information data instance I: (school number, name, age, date of birth, identification number, province, city, zip code), conditional function dependency set

External constraint set Σ_fc＝∑ψ_iThen rule set Σ ═ Σ_cfd∪Σ_fc. For conditional function dependence

X, Y are different attribute fields in the data table, meaning for any two tuples (t)₁，t₂) If t is₁[X]＝t₂[X]Then t₁[Y]＝t₂[Y]. On the contrary, if t₁[X]＝t₂[X]But t is₁[Y]≠t₂[Y]Then t₁And t₂Tuple on rule

There is a consistency error.

For example, for a student basic information table, the first conditional function that can be established depends on the following:

identification number → age, date of birth

Zip code → city, province

School code → name

Secondly, by knowing the enrollment information of schools and computer schools, the external constraints that can be determined are as follows:

ψ₁: the number of students in Hangzhou school is not more than 100

ψ₂: the ratio of male to female in computer school is not less than 3:1

And finally, combining the conditional function dependence and the external constraint condition to obtain a required rule set. It should be noted that the rules described above, such as conditional function dependencies and external constraints, are only used to describe the establishment of the rule set, and are not used to limit the rule set.

And step 3: sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T_LIf not, add to the complete tuple set T_C. Then using the rule set, complete tuple set T of step 2_CRepairing the missing tuples based on the improved KNN mixed filling algorithm, wherein the integrity repairing process is shown in FIG. 2;

the step specifically includes two processes of integrity check and integrity repair.

And (4) integrity checking: sequentially traversing all tuples in the data, judging whether the current tuple has deficiency, if so, adding the current tuple to the deficiency tuple set T_LIf not, add to the complete tuple set T_CAll tuples containing missing data are detected.

And (3) integrity repair: after the integrity check, sequentially traversing the missing tuple sets T_LCheck if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ)_cfdAnd/or external constraints ∑_fc) And matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on the improved KNN.

And 4, step 4: and (3) traversing all tuples of the data processed in the step (3) in sequence, checking whether the tuples are matched with all rules in the rule set in the step (2), and recording the conditional function dependence and/or external constraint violated by the tuples with consistency errors. Then, repairing the error data of the error tuple according to the rule repairing sequence and the target tuple, and realizing the checking and repairing of the data consistency, wherein the consistency repairing process is shown in fig. 3;

the step specifically comprises two processes of consistency check and consistency repair.

And (3) checking consistency: sequentially traversing all tuples in the data, checking whether the current tuple is matched with all rules in the rule set sigma in the step 2, if so, continuously checking the next tuple, and otherwise, checking the rule violated by the current tuple (conditional function depends on sigma)_cfdAnd/or external constraints ∑_fc) Adding to an abnormal rule set Σ';

and (3) consistency repair: the method specifically comprises 3 processes of determining a rule repairing sequence, positioning an abnormal tuple and selecting a target tuple.

And determining a rule repairing sequence, wherein in view of the fact that the different conditional function dependencies in the abnormal rule set may contain the same attribute field, it is required to determine which rule sequence to repair, otherwise, the wrong repairing may be caused. In a specific implementation, the rule repair order is determined by constructing a rule sequence diagram, then performing topology sorting on the rule sequence diagram, sequentially selecting nodes (conditional function dependencies) with an in-degree of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule sequence set Σ_repThe node and the edges connected to the node are then deleted until there are no more nodes remaining in the graph. If the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a rule sequence with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set;

locating abnormal tuples and traversing the repairing rule set sigma in sequence_repAll of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T_e；

Selecting the target tuple, the selection of the abnormal data target value is a key problem for consistency repair. Giving abnormal tuple set, selecting different repairing target values, and judging whether the repairing result is large or notAnd the corresponding repair cost is different. In a specific implementation, it is necessary to combine the external constraints Σ in the rule set Σ_fcIn the abnormal tuple set T_eOne tuple with the smallest repair cost is selected as a target tuple to repair other abnormal tuples.

And 5: and (4) checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step (4). If so, the two tuple data are considered to be repeated, and the repeated tuple in the window is deleted; and if not, the first tuple and other tuples are considered to meet the uniqueness condition, and the first tuple in the sliding window is moved out and the next tuple of the last tuple in the window is moved in. Repeating the steps until the uniqueness check of all the tuples is completed, wherein the uniqueness repair flow is shown in FIG. 4;

in this step, first, one or a group of data is selected for all tuples in the data set, and a key value of each tuple is calculated and used as a sorting key of the tuple.

Secondly, all the tuples are sorted according to the sorting key, and the tuples with similar and repeated data are adjacent in sequence.

Then, a sliding window with an initial size of N is set on the sorted tuples (as shown in fig. 5), the tuple distance between the first tuple in the window and the other N-1 tuples in the window is calculated, and if the tuple distance between a certain tuple and the first tuple is smaller than the set distance threshold, the similar duplicate tuple is deleted.

And finally, moving a sliding window step length, moving out the first tuple in the sliding window and moving in the next tuple of the last tuple, and repeating the steps until all tuples in the data are checked.

In the process of sliding the window, the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the window is calculated and used as the average density of the window, if the average density of the window is higher than a set density threshold value, the similarity between the tuples in the sliding window is considered to be lower, the size of the sliding window can be properly reduced to reduce the comparison times, and the repair efficiency is improved. On the contrary, if the average density of the window is lower than the set density threshold, the similarity between the elements in the sliding window is considered to be higher, and the size of the sliding window can be properly increased to expand the matching range and improve the repairing accuracy.

In order to further reduce matching errors among all tuples, new sorting keywords can be reselected to perform sorting, checking and repairing again, similar repeated tuples on data are deleted as far as possible through a multiple sliding window detection mechanism, and the accuracy of uniqueness checking and repairing is improved.

Step 6: and (4) rechecking whether all tuples in the data processed in the step (5) are matched with all rules in the rule set, if so, all tuples meet the consistency condition, finishing the cleaning of the data, and if not, returning to the step (4-2) to continue executing.

As shown in fig. 6, in an embodiment of the present invention, there is also provided a multi-source heterogeneous data cleaning apparatus, including: the system comprises a data acquisition and preprocessing module, a rule set construction module, an integrity checking and repairing module, a consistency checking and repairing module, a uniqueness checking and repairing module and a consistency secondary checking module.

The above are merely specific embodiments of the present invention, and are not intended to limit the present invention. It will be apparent to those skilled in the art that the present application is susceptible to many modifications and variations in view of the specific application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A multi-source heterogeneous data cleaning method is characterized by comprising the following steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples;

3-1 integrity check

3-2 integrity repair

Traversing missing tuple sets T in sequence_LChecking whether the missing items of the current missing tuples are matched with some rules in the rule set sigma or not, if so, filling the missing data of the current missing tuples by using the rules, otherwise, filling the missing data of the current missing tuples by using a modified KNN-based mixed filling algorithm;

1) dividing the non-missing data column of the current missing tuple into 5 types of missing tuples, namely numerical type, binary type, ordinal type, classification type and text type;

2) set the complete tuples T_CDividing the same data column corresponding to each type of tuple in the current missing tuples into 5 types of complete tuple sets;

3) respectively calculating the subclass distance between each type of missing tuple and the complete tuple;

4) computing missing tuples t₁And complete tuple t₂Tuple distance between;

7) selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of missing data of the missing tuples;

4-1 consistency check:

sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, adding the rule violated by the current tuple to the abnormal rule set sigma';

4-2, consistency repair;

and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4;

step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data, and if not, returning to the step (4-2) to continue the execution.

2. The method for cleaning multi-source heterogeneous data according to claim 1, wherein the step 3) in the integrity repair of the step 3-2 is specifically:

D(L,C)_ordi＝D(L,C)_numformula (4)

Wherein, if all values of the ith column data of the subclass tuple are sequentially regarded as a sequence from 0 to N, then N is_iIndicates the total number N, M of serial numbers of the ith row data_iIndicating the number of values of the data in the sequence, X_iRepresenting the converted numerical data;

wherein D is_i(L,C)_textIndicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple_j、C_kRespectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein j is more than or equal to 0 and is less than or equal to U_i,0≤k≤V_iMin represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple_i、V_iThe total length of the ith string data in the missing sub-class tuple and the complete sub-class tuple is represented respectively, and Max represents a maximum function.

3. The method for cleaning multi-source heterogeneous data according to claim 2, wherein the step 4) in the integrity repair in the step 3-2 is specifically:

where i represents a subclass tuple of 5 types, W_iWeight coefficient, D, representing the i-th type of sub-class tuple in the current tuple_i(t₁,t₂) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y represents in the current tupleTotal number of data, Y_iRepresenting the number of i-th type data in the current tuple.

4. The method according to claim 3, wherein the k value in step 6) of the integrity repair in step 3-2 is obtained by:

6-2) dividing the training tuple set into n sub-tuple sets with the same size;

6-5) the average of these n k values is used as the repair k value for the set of test tuples.

5. The method according to claim 1, wherein the consistency repair in step 4 mainly comprises determining a rule repair order, locating an abnormal tuple, and selecting a target tuple; the method comprises the following steps:

4-2-1 determining a rule repair order

If it is not

Then

There is a route between

Point of direction

Is namely

4-2-2 location anomaly tuples

4-2-3 select target tuples

Combining external constraint rules Σ in rule sets Σ_fcIn the abnormal tuple set T_eAnd selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple.

6. The method for cleaning multi-source heterogeneous data according to claim 1, wherein in step 5, an improved SNM algorithm based on a mixed distance and a dynamic window is used for the data processed in step 4, and whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold is checked; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; and moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, and repeating the steps until all tuples finish uniqueness check, so as to realize data uniqueness check and repair.

7. The method for cleaning multi-source heterogeneous data according to claim 6, wherein the improved SNM algorithm based on the hybrid distance and the dynamic window comprises the following specific steps:

5-2) sorting all tuples according to the sorting keywords;

5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window, deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold value, and otherwise moving the sliding window by one step length;

5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, taking the ratio as the window average density, increasing the size of the sliding window if the window average density is smaller than a density threshold, keeping the size of the sliding window unchanged if the window average density is equal to the density threshold, decreasing the size of the sliding window if the window average density is larger than the density threshold, and continuing to slide until all tuple inspection is finished.

8. A multi-source heterogeneous data cleaning device is characterized by comprising the following modules: