CN105468658B

CN105468658B - Data cleaning method and device

Info

Publication number: CN105468658B
Application number: CN201410503126.0A
Authority: CN
Inventors: 廖振松; 熊胜; 吴勤华; 杨晶蕾; 冯文仲; 沈力; 黄艳; 田纪军; 莫益军; 曾志华
Original assignee: China Mobile Group Hubei Co Ltd
Current assignee: China Mobile Group Hubei Co Ltd
Priority date: 2014-09-26
Filing date: 2014-09-26
Publication date: 2020-04-03
Anticipated expiration: 2034-09-26
Also published as: CN105468658A

Abstract

The invention discloses a data cleaning method, which comprises the steps of obtaining data to be cleaned, and obtaining fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set. The invention also discloses a data cleaning device.

Description

Data cleaning method and device

Technical Field

The invention relates to a data processing technology in the field of computers, in particular to a data cleaning method and device.

Background

With the progress of science and technology and the rapid development of computer technology, people can obtain more and more digital information, and meanwhile, more time needs to be invested to organize and arrange the information. Before statistical analysis is performed on the data, dirty data, i.e., noise data, in the data needs to be filtered out to ensure the accuracy of statistics. Data cleansing is a process of detecting and eliminating errors and inconsistencies of data in a database and improving data quality, and the principle is to convert the data into data meeting data quality requirements by using related technologies.

However, in the related art of the existing data cleansing, at least the following problems exist: 1) the related technology mainly aims at the real-time historical database for processing, and the applicability to non-real-time historical data is not high; 2) the related technology has low efficiency of cleaning data with low relevance; 3) in the related technology, the cleaning process is only suitable for sample data, and the cleaning of mass data cannot be realized.

Disclosure of Invention

In view of this, embodiments of the present invention are intended to provide a data cleansing method and apparatus, which can accurately find the data quality problem and effectively complete the cleansing of data.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a data cleaning method, which comprises the following steps:

acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;

searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; wherein M is a positive integer;

and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.

In the foregoing scheme, after the data to be cleaned is acquired, the method further includes: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.

In the foregoing solution, the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned includes:

acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein the P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

determining that the value of the probability P is greater than a preset threshold P₀And marking the field to which the noise data belongs as a field to be cleaned.

In the foregoing solution, the performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets includes:

and sequentially carrying out high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity.

In the foregoing solution, the performing data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set includes:

and carrying out semantic analysis on the field to be cleaned, acquiring a tensor field set corresponding to the field to be cleaned in the tensor field set according to field types, further acquiring a tensor field relevant to the field to be cleaned in the tensor field set, and carrying out data cleaning on the field to be cleaned by utilizing the tensor field.

The embodiment of the invention also provides a data cleaning device, which comprises: the system comprises an acquisition module, a processing module and a data cleaning module; wherein the content of the first and second substances,

the acquisition module is used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;

the processing module is used for searching the expandable dimension field in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;

and the data cleaning module is used for performing data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.

In the above scheme, the apparatus further comprises: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.

In the above scheme, the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

In the foregoing solution, the processing module is specifically configured to sequentially perform high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm, obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.

In the foregoing solution, the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain, according to a field type, a tensor field set corresponding to the field to be cleaned in the tensor field set, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.

The data cleaning method and the data cleaning device provided by the embodiment of the invention are used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set. Therefore, through analysis of noise data distribution in the data to be cleaned, fields which cannot be found by common data cleaning rules and detection methods and have quality problems are accurately obtained, and cleaning of massive data, non-real-time historical data or data with low relevance is effectively completed based on high-order tensor dimension expansion.

Drawings

FIG. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention.

Detailed Description

In the embodiment of the invention, data to be cleaned is obtained, and fields to be cleaned in the data to be cleaned are obtained according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set; wherein M is a positive integer.

Fig. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 1, the flow chart of the data cleaning method according to the embodiment includes:

step 101: acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;

this step can be realized by an acquisition module in the server;

before this step, the method further comprises: analyzing a data source, and establishing a database according to data characteristics;

here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.

Further, after the data to be cleaned is obtained, before the field to be cleaned in the data to be cleaned is obtained according to analysis of noise data distribution in the data to be cleaned, the method further includes:

inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;

here, the optimizing the database includes: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;

the obtained original database is as follows: the data to be cleaned is completely entered in the database.

Further, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;

the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned comprises:

acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

determining that the value of the probability P is greater than a preset threshold P₀When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;

wherein the noise data values may include: missing values, error values, inconsistent values, and the like;

the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;

the preset threshold value P₀The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.

Step 102: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;

this step can be implemented by a processing module in the server;

here, the dimension-expandable field is a field from which more information can be obtained after expanding the dimension of the field; such as: an imei number field;

the searching for the expandable dimension field in the data to be cleaned comprises:

matching fields in the data to be cleaned with expandable dimension fields in a preset expandable dimension field library to obtain expandable dimension fields in the data to be cleaned;

the performing high-order tensor dimension expansion on the expandable field to obtain M tensor field sets includes:

sequentially performing high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm Tucker to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;

wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;

the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:

calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;

wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);

here, the field semantics are the meaning of the field itself;

the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.

Step 103: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;

the step can be realized by a data cleaning module in the server;

the method specifically comprises the following steps: semantic analysis is carried out on the field to be cleaned, a tensor field set corresponding to the field to be cleaned in the tensor field set is obtained according to field types, tensor fields related to the field to be cleaned in the tensor field set are further obtained, and data cleaning is carried out on the field to be cleaned through the tensor fields;

wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;

the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;

or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;

the data cleaning of the field to be cleaned by using the tensor field comprises the following steps:

filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, and repairing the inconsistent value by using the tensor field which has a function dependency relationship with the field to be cleaned.

Further, after this step, the method further comprises: updating the cleaned data to a database and recording a cleaning log;

here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;

wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;

and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.

Fig. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 2, the flow chart of the data cleaning method according to the embodiment includes:

step 201: analyzing a data source, and establishing a database according to data characteristics;

Step 202: acquiring data to be cleaned, inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;

here, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;

the optimizing the database comprises: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;

Step 203: according to analysis of noise data distribution in the data to be cleaned, obtaining fields to be cleaned in the data to be cleaned;

the method specifically comprises the following steps: acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

Step 204: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;

here, the field semantics are the meaning of the field itself;

Step 205: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;

Step 206: updating the cleaned data to a database and recording a cleaning log;

FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention; as shown in fig. 3, the data cleaning apparatus according to the embodiment of the present invention includes: an acquisition module 31, a processing module 32 and a data cleaning module 33; wherein the content of the first and second substances,

the acquiring module 31 is configured to acquire data to be cleaned, and acquire a field to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;

the processing module 32 is configured to search for an expandable dimension field in the data to be cleaned, and perform high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;

the data cleaning module 33 is configured to perform data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set.

Further, the apparatus further comprises: the establishing module 35 is used for analyzing the data source and establishing a database according to the data characteristics;

Further, the apparatus further comprises: the entry module 34 is configured to enter the data to be cleaned into an established database, and optimize the database to obtain an original database;

the logging module 34 optimizing the database includes: the entry module 34 repairs problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;

Further, the acquiring module 31 acquires the data to be cleaned, including: the obtaining module 31 obtains the data to be cleaned through the data source by using a database management tool.

Further, the obtaining, by the obtaining module 31, a field to be cleaned in the data to be cleaned according to analysis of distribution of noise data in the data to be cleaned includes:

the obtaining module 31 obtains the probability P of the occurrence of the noise data value in any field of the data to be cleaned in the specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

Further, the processing module 32 performs higher-order tensor dimension expansion on the dimension-expandable field, and obtaining M tensor field sets includes:

the processing module 32 sequentially performs high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifies the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;

the extensible field is a field which can obtain more information after the field is extended in dimension; such as: an imei number field;

here, the field semantics are the meaning of the field itself;

Further, the data cleaning module 33 performs data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set, including:

the data cleaning module 33 performs semantic analysis on the field to be cleaned, acquires a tensor field set corresponding to the field to be cleaned in the tensor field set according to the field type, further acquires a tensor field related to the field to be cleaned in the tensor field set, and performs data cleaning on the field to be cleaned by using the tensor field;

the data cleaning module 33 performing data cleaning on the field to be cleaned by using the tensor field includes:

the data cleaning module 33 fills the missing value and the repaired error value of the field to be cleaned with the tensor field whose semantics are basically consistent with those of the field to be cleaned, and repairs the inconsistent value with the tensor field having a functional dependency relationship with the field to be cleaned.

Further, the device further comprises an updating module 36, configured to update the cleaned data to the database and record a cleaning log;

In the embodiment of the present invention, the data cleaning apparatus may be located in a server, and the obtaining module 31, the Processing module 32, the data cleaning module 33, the recording module 34, the establishing module 35, and the updating module 36 may all be implemented by a Central Processing Unit (CPU) in the server, or a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA).

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of data cleansing, the method comprising:

and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set, wherein the tensor field related to the field to be cleaned in the tensor field set is as follows: tensor fields with semantics basically consistent with the to-be-cleaned fields or tensor fields with functional dependency relation with the to-be-cleaned fields; the set of tensor fields includes: a traffic-related tensor field set or a terminal-related tensor field set; the terminal-related tensor field set includes: at least one of a terminal type, a manufacturer, and a tensor field of a production place; the data cleaning of the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set comprises: filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, or repairing the field to be cleaned by using a tensor field which has a functional dependency relationship with the field to be cleaned.

2. The method of claim 1, wherein after the obtaining the data to be cleaned, the method further comprises: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.

3. The method according to claim 1 or 2, wherein the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the distribution of the noise data in the data to be cleaned comprises:

4. The method of claim 1 or 2, wherein performing a higher-order tensor expansion on the expandable dimension field to obtain M sets of tensor fields comprises:

5. The method according to claim 1 or 2, wherein the data cleaning of the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set comprises:

6. A data cleansing apparatus, said apparatus comprising: the system comprises an acquisition module, a processing module and a data cleaning module; wherein the content of the first and second substances,

the data cleaning module is configured to perform data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set, where the tensor field related to the field to be cleaned in the tensor field set is: tensor fields with semantics basically consistent with the to-be-cleaned fields or tensor fields with functional dependency relation with the to-be-cleaned fields; the set of tensor fields includes: a traffic-related tensor field set or a terminal-related tensor field set; the terminal-related tensor field set includes: at least one of a terminal type, a manufacturer, and a tensor field of a production place;

the data cleaning module is further configured to fill the missing value and the repaired error value of the field to be cleaned with a tensor field whose semantic is substantially consistent with that of the field to be cleaned, or repair the field to be cleaned with a tensor field having a functional dependency relationship with the field to be cleaned.

7. The apparatus of claim 6, further comprising: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.

8. The device according to claim 6 or 7, wherein the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;

9. The apparatus according to claim 6 or 7, wherein the processing module is specifically configured to perform high-order tensor dimension expansion on the expandable dimension fields in sequence by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.

10. The apparatus according to claim 6 or 7, wherein the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain a tensor field set corresponding to the field to be cleaned in the tensor field set according to a field type, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.