CN105468658B - Data cleaning method and device - Google Patents

Data cleaning method and device Download PDF

Info

Publication number
CN105468658B
CN105468658B CN201410503126.0A CN201410503126A CN105468658B CN 105468658 B CN105468658 B CN 105468658B CN 201410503126 A CN201410503126 A CN 201410503126A CN 105468658 B CN105468658 B CN 105468658B
Authority
CN
China
Prior art keywords
field
cleaned
tensor
data
fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410503126.0A
Other languages
Chinese (zh)
Other versions
CN105468658A (en
Inventor
廖振松
熊胜
吴勤华
杨晶蕾
冯文仲
沈力
黄艳
田纪军
莫益军
曾志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Hubei Co Ltd
Original Assignee
China Mobile Group Hubei Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Hubei Co Ltd filed Critical China Mobile Group Hubei Co Ltd
Priority to CN201410503126.0A priority Critical patent/CN105468658B/en
Publication of CN105468658A publication Critical patent/CN105468658A/en
Application granted granted Critical
Publication of CN105468658B publication Critical patent/CN105468658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data cleaning method, which comprises the steps of obtaining data to be cleaned, and obtaining fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set. The invention also discloses a data cleaning device.

Description

Data cleaning method and device
Technical Field
The invention relates to a data processing technology in the field of computers, in particular to a data cleaning method and device.
Background
With the progress of science and technology and the rapid development of computer technology, people can obtain more and more digital information, and meanwhile, more time needs to be invested to organize and arrange the information. Before statistical analysis is performed on the data, dirty data, i.e., noise data, in the data needs to be filtered out to ensure the accuracy of statistics. Data cleansing is a process of detecting and eliminating errors and inconsistencies of data in a database and improving data quality, and the principle is to convert the data into data meeting data quality requirements by using related technologies.
However, in the related art of the existing data cleansing, at least the following problems exist: 1) the related technology mainly aims at the real-time historical database for processing, and the applicability to non-real-time historical data is not high; 2) the related technology has low efficiency of cleaning data with low relevance; 3) in the related technology, the cleaning process is only suitable for sample data, and the cleaning of mass data cannot be realized.
Disclosure of Invention
In view of this, embodiments of the present invention are intended to provide a data cleansing method and apparatus, which can accurately find the data quality problem and effectively complete the cleansing of data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a data cleaning method, which comprises the following steps:
acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; wherein M is a positive integer;
and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
In the foregoing scheme, after the data to be cleaned is acquired, the method further includes: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.
In the foregoing solution, the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned includes:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein the P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
In the foregoing solution, the performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets includes:
and sequentially carrying out high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity.
In the foregoing solution, the performing data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set includes:
and carrying out semantic analysis on the field to be cleaned, acquiring a tensor field set corresponding to the field to be cleaned in the tensor field set according to field types, further acquiring a tensor field relevant to the field to be cleaned in the tensor field set, and carrying out data cleaning on the field to be cleaned by utilizing the tensor field.
The embodiment of the invention also provides a data cleaning device, which comprises: the system comprises an acquisition module, a processing module and a data cleaning module; wherein the content of the first and second substances,
the acquisition module is used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module is used for searching the expandable dimension field in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
and the data cleaning module is used for performing data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
In the above scheme, the apparatus further comprises: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.
In the above scheme, the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
In the foregoing solution, the processing module is specifically configured to sequentially perform high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm, obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.
In the foregoing solution, the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain, according to a field type, a tensor field set corresponding to the field to be cleaned in the tensor field set, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.
The data cleaning method and the data cleaning device provided by the embodiment of the invention are used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set. Therefore, through analysis of noise data distribution in the data to be cleaned, fields which cannot be found by common data cleaning rules and detection methods and have quality problems are accurately obtained, and cleaning of massive data, non-real-time historical data or data with low relevance is effectively completed based on high-order tensor dimension expansion.
Drawings
FIG. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention.
Detailed Description
In the embodiment of the invention, data to be cleaned is obtained, and fields to be cleaned in the data to be cleaned are obtained according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set; wherein M is a positive integer.
Fig. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 1, the flow chart of the data cleaning method according to the embodiment includes:
step 101: acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
this step can be realized by an acquisition module in the server;
before this step, the method further comprises: analyzing a data source, and establishing a database according to data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Further, after the data to be cleaned is obtained, before the field to be cleaned in the data to be cleaned is obtained according to analysis of noise data distribution in the data to be cleaned, the method further includes:
inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;
here, the optimizing the database includes: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Further, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;
the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned comprises:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Step 102: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;
this step can be implemented by a processing module in the server;
here, the dimension-expandable field is a field from which more information can be obtained after expanding the dimension of the field; such as: an imei number field;
the searching for the expandable dimension field in the data to be cleaned comprises:
matching fields in the data to be cleaned with expandable dimension fields in a preset expandable dimension field library to obtain expandable dimension fields in the data to be cleaned;
the performing high-order tensor dimension expansion on the expandable field to obtain M tensor field sets includes:
sequentially performing high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm Tucker to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Step 103: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;
the step can be realized by a data cleaning module in the server;
the method specifically comprises the following steps: semantic analysis is carried out on the field to be cleaned, a tensor field set corresponding to the field to be cleaned in the tensor field set is obtained according to field types, tensor fields related to the field to be cleaned in the tensor field set are further obtained, and data cleaning is carried out on the field to be cleaned through the tensor fields;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning of the field to be cleaned by using the tensor field comprises the following steps:
filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, and repairing the inconsistent value by using the tensor field which has a function dependency relationship with the field to be cleaned.
Further, after this step, the method further comprises: updating the cleaned data to a database and recording a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
Fig. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 2, the flow chart of the data cleaning method according to the embodiment includes:
step 201: analyzing a data source, and establishing a database according to data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Step 202: acquiring data to be cleaned, inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;
here, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;
the optimizing the database comprises: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Step 203: according to analysis of noise data distribution in the data to be cleaned, obtaining fields to be cleaned in the data to be cleaned;
the method specifically comprises the following steps: acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Step 204: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;
here, the dimension-expandable field is a field from which more information can be obtained after expanding the dimension of the field; such as: an imei number field;
the searching for the expandable dimension field in the data to be cleaned comprises:
matching fields in the data to be cleaned with expandable dimension fields in a preset expandable dimension field library to obtain expandable dimension fields in the data to be cleaned;
the performing high-order tensor dimension expansion on the expandable field to obtain M tensor field sets includes:
sequentially performing high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm Tucker to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Step 205: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;
the method specifically comprises the following steps: semantic analysis is carried out on the field to be cleaned, a tensor field set corresponding to the field to be cleaned in the tensor field set is obtained according to field types, tensor fields related to the field to be cleaned in the tensor field set are further obtained, and data cleaning is carried out on the field to be cleaned through the tensor fields;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning of the field to be cleaned by using the tensor field comprises the following steps:
filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, and repairing the inconsistent value by using the tensor field which has a function dependency relationship with the field to be cleaned.
Step 206: updating the cleaned data to a database and recording a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention; as shown in fig. 3, the data cleaning apparatus according to the embodiment of the present invention includes: an acquisition module 31, a processing module 32 and a data cleaning module 33; wherein the content of the first and second substances,
the acquiring module 31 is configured to acquire data to be cleaned, and acquire a field to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module 32 is configured to search for an expandable dimension field in the data to be cleaned, and perform high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
the data cleaning module 33 is configured to perform data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set.
Further, the apparatus further comprises: the establishing module 35 is used for analyzing the data source and establishing a database according to the data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Further, the apparatus further comprises: the entry module 34 is configured to enter the data to be cleaned into an established database, and optimize the database to obtain an original database;
the logging module 34 optimizing the database includes: the entry module 34 repairs problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Further, the acquiring module 31 acquires the data to be cleaned, including: the obtaining module 31 obtains the data to be cleaned through the data source by using a database management tool.
Further, the obtaining, by the obtaining module 31, a field to be cleaned in the data to be cleaned according to analysis of distribution of noise data in the data to be cleaned includes:
the obtaining module 31 obtains the probability P of the occurrence of the noise data value in any field of the data to be cleaned in the specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Further, the processing module 32 performs higher-order tensor dimension expansion on the dimension-expandable field, and obtaining M tensor field sets includes:
the processing module 32 sequentially performs high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifies the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the extensible field is a field which can obtain more information after the field is extended in dimension; such as: an imei number field;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Further, the data cleaning module 33 performs data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set, including:
the data cleaning module 33 performs semantic analysis on the field to be cleaned, acquires a tensor field set corresponding to the field to be cleaned in the tensor field set according to the field type, further acquires a tensor field related to the field to be cleaned in the tensor field set, and performs data cleaning on the field to be cleaned by using the tensor field;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning module 33 performing data cleaning on the field to be cleaned by using the tensor field includes:
the data cleaning module 33 fills the missing value and the repaired error value of the field to be cleaned with the tensor field whose semantics are basically consistent with those of the field to be cleaned, and repairs the inconsistent value with the tensor field having a functional dependency relationship with the field to be cleaned.
Further, the device further comprises an updating module 36, configured to update the cleaned data to the database and record a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
In the embodiment of the present invention, the data cleaning apparatus may be located in a server, and the obtaining module 31, the Processing module 32, the data cleaning module 33, the recording module 34, the establishing module 35, and the updating module 36 may all be implemented by a Central Processing Unit (CPU) in the server, or a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA).
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method of data cleansing, the method comprising:
acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; wherein M is a positive integer;
and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set, wherein the tensor field related to the field to be cleaned in the tensor field set is as follows: tensor fields with semantics basically consistent with the to-be-cleaned fields or tensor fields with functional dependency relation with the to-be-cleaned fields; the set of tensor fields includes: a traffic-related tensor field set or a terminal-related tensor field set; the terminal-related tensor field set includes: at least one of a terminal type, a manufacturer, and a tensor field of a production place; the data cleaning of the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set comprises: filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, or repairing the field to be cleaned by using a tensor field which has a functional dependency relationship with the field to be cleaned.
2. The method of claim 1, wherein after the obtaining the data to be cleaned, the method further comprises: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.
3. The method according to claim 1 or 2, wherein the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the distribution of the noise data in the data to be cleaned comprises:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein the P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
4. The method of claim 1 or 2, wherein performing a higher-order tensor expansion on the expandable dimension field to obtain M sets of tensor fields comprises:
and sequentially carrying out high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity.
5. The method according to claim 1 or 2, wherein the data cleaning of the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set comprises:
and carrying out semantic analysis on the field to be cleaned, acquiring a tensor field set corresponding to the field to be cleaned in the tensor field set according to field types, further acquiring a tensor field relevant to the field to be cleaned in the tensor field set, and carrying out data cleaning on the field to be cleaned by utilizing the tensor field.
6. A data cleansing apparatus, said apparatus comprising: the system comprises an acquisition module, a processing module and a data cleaning module; wherein the content of the first and second substances,
the acquisition module is used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module is used for searching the expandable dimension field in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
the data cleaning module is configured to perform data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set, where the tensor field related to the field to be cleaned in the tensor field set is: tensor fields with semantics basically consistent with the to-be-cleaned fields or tensor fields with functional dependency relation with the to-be-cleaned fields; the set of tensor fields includes: a traffic-related tensor field set or a terminal-related tensor field set; the terminal-related tensor field set includes: at least one of a terminal type, a manufacturer, and a tensor field of a production place;
the data cleaning module is further configured to fill the missing value and the repaired error value of the field to be cleaned with a tensor field whose semantic is substantially consistent with that of the field to be cleaned, or repair the field to be cleaned with a tensor field having a functional dependency relationship with the field to be cleaned.
7. The apparatus of claim 6, further comprising: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.
8. The device according to claim 6 or 7, wherein the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
9. The apparatus according to claim 6 or 7, wherein the processing module is specifically configured to perform high-order tensor dimension expansion on the expandable dimension fields in sequence by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.
10. The apparatus according to claim 6 or 7, wherein the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain a tensor field set corresponding to the field to be cleaned in the tensor field set according to a field type, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.
CN201410503126.0A 2014-09-26 2014-09-26 Data cleaning method and device Active CN105468658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410503126.0A CN105468658B (en) 2014-09-26 2014-09-26 Data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410503126.0A CN105468658B (en) 2014-09-26 2014-09-26 Data cleaning method and device

Publications (2)

Publication Number Publication Date
CN105468658A CN105468658A (en) 2016-04-06
CN105468658B true CN105468658B (en) 2020-04-03

Family

ID=55606362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410503126.0A Active CN105468658B (en) 2014-09-26 2014-09-26 Data cleaning method and device

Country Status (1)

Country Link
CN (1) CN105468658B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101507B (en) * 2017-06-20 2023-09-26 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN110596594A (en) * 2019-09-23 2019-12-20 广东毓秀科技有限公司 Method for predicting SOE of rail-traffic lithium battery through big data
CN114880314B (en) * 2022-05-23 2023-03-24 北京正远达科技有限公司 Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593214A (en) * 2008-05-28 2009-12-02 美国日本电气实验室公司 Be used to handle the system and method for high dimensional data
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN103473375A (en) * 2013-09-29 2013-12-25 方正国际软件有限公司 Data cleaning method and data cleaning system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553327B2 (en) * 1998-09-16 2003-04-22 Yeda Research & Development Co., Ltd. Apparatus for monitoring a system with time in space and method therefor
US7904583B2 (en) * 2003-07-11 2011-03-08 Ge Fanuc Automation North America, Inc. Methods and systems for managing and controlling an automation control module system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593214A (en) * 2008-05-28 2009-12-02 美国日本电气实验室公司 Be used to handle the system and method for high dimensional data
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN103473375A (en) * 2013-09-29 2013-12-25 方正国际软件有限公司 Data cleaning method and data cleaning system

Also Published As

Publication number Publication date
CN105468658A (en) 2016-04-06

Similar Documents

Publication Publication Date Title
CN116881430B (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN111680855A (en) Automatic risk detection and early warning method and system for whole process of project
CN109558166B (en) Code searching method oriented to defect positioning
CN105468658B (en) Data cleaning method and device
Stephan et al. Using mutation analysis for a model-clone detector comparison framework
CN111127068A (en) Automatic pricing method and device for engineering quantity list
CN112328499A (en) Test data generation method, device, equipment and medium
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
CN110334262B (en) Model training method and device and electronic equipment
CN115422201A (en) Layer-level data analysis method and device and electronic equipment
KR102345410B1 (en) Big data intelligent collecting method and device
CN103475532A (en) Hardware detection method and system thereof
CN114676961A (en) Enterprise external migration risk prediction method and device and computer readable storage medium
CN115982655A (en) Missing data flow abnormity prediction method based on decision tree
CN117609278A (en) Multi-mode power data management method and system based on deep measurement learning
CN105677723A (en) Method for establishing and searching data labels for industrial signal source
CN117093556A (en) Log classification method, device, computer equipment and computer readable storage medium
CN109389972B (en) Quality testing method and device for semantic cloud function, storage medium and equipment
CN115759885A (en) Material sampling inspection method and device based on distributed material supply
CN115878599A (en) Sewage industry data cleaning method
CN106776704B (en) Statistical information collection method and device
CN114722960A (en) Method and system for detecting incomplete track of event log in business process
CN111291376B (en) Web vulnerability verification method based on crowdsourcing and machine learning
CN114218383A (en) Method, device and application for judging repeated events
CN112559499A (en) Data mining system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant