CN105468658A - Data cleaning method and apparatus - Google Patents
Data cleaning method and apparatus Download PDFInfo
- Publication number
- CN105468658A CN105468658A CN201410503126.0A CN201410503126A CN105468658A CN 105468658 A CN105468658 A CN 105468658A CN 201410503126 A CN201410503126 A CN 201410503126A CN 105468658 A CN105468658 A CN 105468658A
- Authority
- CN
- China
- Prior art keywords
- field
- cleaned
- data
- tensor
- fields
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 238000009826 distribution Methods 0.000 claims abstract description 16
- 238000000354 decomposition reaction Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 description 7
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000008439 repair process Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a data cleaning method. The method comprises: acquiring to-be-cleaned data, and according to analysis for noise data distribution in the to-be-cleaned data, obtaining to-be-cleaned fields in the to-be-cleaned data; searching a field which can be subjected to dimension expansion in the to-be-cleaned data, and performing high-order tensor dimension-expanding on the field which can be subjected to dimension expansion to obtain M tensor fieldsets; and carrying out data cleaning on the to-be-cleaned fields by using tensor fields in the tensor fieldsets, which are associated with the to-be-cleaned fields. The present invention also discloses a data cleaning apparatus.
Description
Technical Field
The invention relates to a data processing technology in the field of computers, in particular to a data cleaning method and device.
Background
With the progress of science and technology and the rapid development of computer technology, people can obtain more and more digital information, and meanwhile, more time needs to be invested to organize and arrange the information. Before statistical analysis is performed on the data, dirty data, i.e., noise data, in the data needs to be filtered out to ensure the accuracy of statistics. Data cleansing is a process of detecting and eliminating errors and inconsistencies of data in a database and improving data quality, and the principle is to convert the data into data meeting data quality requirements by using related technologies.
However, in the related art of the existing data cleansing, at least the following problems exist: 1) the related technology mainly aims at the real-time historical database for processing, and the applicability to non-real-time historical data is not high; 2) the related technology has low efficiency of cleaning data with low relevance; 3) in the related technology, the cleaning process is only suitable for sample data, and the cleaning of mass data cannot be realized.
Disclosure of Invention
In view of this, embodiments of the present invention are intended to provide a data cleansing method and apparatus, which can accurately find the data quality problem and effectively complete the cleansing of data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a data cleaning method, which comprises the following steps:
acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; wherein M is a positive integer;
and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
In the foregoing scheme, after the data to be cleaned is acquired, the method further includes: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.
In the foregoing solution, the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned includes:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein the P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
In the foregoing solution, the performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets includes:
and sequentially carrying out high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity.
In the foregoing solution, the performing data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set includes:
and carrying out semantic analysis on the field to be cleaned, acquiring a tensor field set corresponding to the field to be cleaned in the tensor field set according to field types, further acquiring a tensor field relevant to the field to be cleaned in the tensor field set, and carrying out data cleaning on the field to be cleaned by utilizing the tensor field.
The embodiment of the invention also provides a data cleaning device, which comprises: the system comprises an acquisition module, a processing module and a data cleaning module; wherein,
the acquisition module is used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module is used for searching the expandable dimension field in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
and the data cleaning module is used for performing data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
In the above scheme, the apparatus further comprises: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.
In the above scheme, the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
In the foregoing solution, the processing module is specifically configured to sequentially perform high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm, obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.
In the foregoing solution, the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain, according to a field type, a tensor field set corresponding to the field to be cleaned in the tensor field set, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.
The data cleaning method and the data cleaning device provided by the embodiment of the invention are used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set. Therefore, through analysis of noise data distribution in the data to be cleaned, fields which cannot be found by common data cleaning rules and detection methods and have quality problems are accurately obtained, and cleaning of massive data, non-real-time historical data or data with low relevance is effectively completed based on high-order tensor dimension expansion.
Drawings
FIG. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention.
Detailed Description
In the embodiment of the invention, data to be cleaned is obtained, and fields to be cleaned in the data to be cleaned are obtained according to analysis of noise data distribution in the data to be cleaned; searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set; wherein M is a positive integer.
Fig. 1 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 1, the flow chart of the data cleaning method according to the embodiment includes:
step 101: acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
this step can be realized by an acquisition module in the server;
before this step, the method further comprises: analyzing a data source, and establishing a database according to data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Further, after the data to be cleaned is obtained, before the field to be cleaned in the data to be cleaned is obtained according to analysis of noise data distribution in the data to be cleaned, the method further includes:
inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;
here, the optimizing the database includes: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Further, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;
the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the noise data distribution in the data to be cleaned comprises:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Step 102: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;
this step can be implemented by a processing module in the server;
here, the dimension-expandable field is a field from which more information can be obtained after expanding the dimension of the field; such as: an imei number field;
the searching for the expandable dimension field in the data to be cleaned comprises:
matching fields in the data to be cleaned with expandable dimension fields in a preset expandable dimension field library to obtain expandable dimension fields in the data to be cleaned;
the performing high-order tensor dimension expansion on the expandable field to obtain M tensor field sets includes:
sequentially performing high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm Tucker to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Step 103: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;
the step can be realized by a data cleaning module in the server;
the method specifically comprises the following steps: semantic analysis is carried out on the field to be cleaned, a tensor field set corresponding to the field to be cleaned in the tensor field set is obtained according to field types, tensor fields related to the field to be cleaned in the tensor field set are further obtained, and data cleaning is carried out on the field to be cleaned through the tensor fields;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning of the field to be cleaned by using the tensor field comprises the following steps:
filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, and repairing the inconsistent value by using the tensor field which has a function dependency relationship with the field to be cleaned.
Further, after this step, the method further comprises: updating the cleaned data to a database and recording a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
Fig. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present invention, and as shown in fig. 2, the flow chart of the data cleaning method according to the embodiment includes:
step 201: analyzing a data source, and establishing a database according to data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Step 202: acquiring data to be cleaned, inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database;
here, the acquiring data to be cleaned includes: acquiring data to be cleaned from the data source by using a database management tool;
the optimizing the database comprises: repairing problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Step 203: according to analysis of noise data distribution in the data to be cleaned, obtaining fields to be cleaned in the data to be cleaned;
the method specifically comprises the following steps: acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Step 204: searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets;
here, the dimension-expandable field is a field from which more information can be obtained after expanding the dimension of the field; such as: an imei number field;
the searching for the expandable dimension field in the data to be cleaned comprises:
matching fields in the data to be cleaned with expandable dimension fields in a preset expandable dimension field library to obtain expandable dimension fields in the data to be cleaned;
the performing high-order tensor dimension expansion on the expandable field to obtain M tensor field sets includes:
sequentially performing high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm Tucker to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Step 205: carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set;
the method specifically comprises the following steps: semantic analysis is carried out on the field to be cleaned, a tensor field set corresponding to the field to be cleaned in the tensor field set is obtained according to field types, tensor fields related to the field to be cleaned in the tensor field set are further obtained, and data cleaning is carried out on the field to be cleaned through the tensor fields;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning of the field to be cleaned by using the tensor field comprises the following steps:
filling the vacancy value and the error value of the field to be cleaned by using a tensor field which has basically consistent semantics with the field to be cleaned, and repairing the inconsistent value by using the tensor field which has a function dependency relationship with the field to be cleaned.
Step 206: updating the cleaned data to a database and recording a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
FIG. 3 is a schematic diagram of a data cleaning apparatus according to an embodiment of the present invention; as shown in fig. 3, the data cleaning apparatus according to the embodiment of the present invention includes: an acquisition module 31, a processing module 32 and a data cleaning module 33; wherein,
the acquiring module 31 is configured to acquire data to be cleaned, and acquire a field to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module 32 is configured to search for an expandable dimension field in the data to be cleaned, and perform high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
the data cleaning module 33 is configured to perform data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set.
Further, the apparatus further comprises: the establishing module 35 is used for analyzing the data source and establishing a database according to the data characteristics;
here, the data characteristics may include: number of fields, type of fields, attributes of fields, semantics of fields, etc.
Further, the apparatus further comprises: the entry module 34 is configured to enter the data to be cleaned into an established database, and optimize the database to obtain an original database;
the logging module 34 optimizing the database includes: the entry module 34 repairs problems that arise when the data to be cleaned is entered into the database, such as: the length of the data table is not enough;
the obtained original database is as follows: the data to be cleaned is completely entered in the database.
Further, the acquiring module 31 acquires the data to be cleaned, including: the obtaining module 31 obtains the data to be cleaned through the data source by using a database management tool.
Further, the obtaining, by the obtaining module 31, a field to be cleaned in the data to be cleaned according to analysis of distribution of noise data in the data to be cleaned includes:
the obtaining module 31 obtains the probability P of the occurrence of the noise data value in any field of the data to be cleaned in the specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0When the noise data is detected, marking the field to which the noise data belongs as a field to be cleaned;
wherein the noise data values may include: missing values, error values, inconsistent values, and the like;
the specified time period is a time period set according to actual needs, and can be one or several days, one or several months, one or several quarters and the like; such as 9 months 10 days 2014 to 9 months 15 days 2014;
the preset threshold value P0The method can be set according to actual needs, and can be determined according to the target quality of data cleaning.
Further, the processing module 32 performs higher-order tensor dimension expansion on the dimension-expandable field, and obtaining M tensor field sets includes:
the processing module 32 sequentially performs high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifies the tensor fields into M tensor field sets according to field semantic similarity; m is a positive integer;
wherein the tensor field is: decomposing the expandable dimension field into factor matrixes on multiple dimensions after the expandable dimension field is subjected to high-order tensor dimension expansion; such as: carrying out high-order tensor dimension expansion on the imei number field to obtain tensor fields such as a manufacturer mobile _ vendor and a user terminal type mobile _ type;
the extensible field is a field which can obtain more information after the field is extended in dimension; such as: an imei number field;
the classifying the plurality of tensor fields into sets of M tensor fields by field semantic similarity comprises:
calculating the similarity among the tensor fields according to field semantics, combining the tensor fields with the similarity larger than a preset threshold of the similarity into a tensor field set, and further classifying the tensor fields into M tensor field sets;
wherein the preset threshold value of the similarity may be 0.5, and the range of the similarity is (0, 1);
here, the field semantics are the meaning of the field itself;
the set of tensor fields may be: a traffic-related tensor field set or a terminal-related tensor field set; for example: the set of terminal-related tensor fields includes: tensor fields such as terminal type, manufacturer, place of production, etc.
Further, the data cleaning module 33 performs data cleaning on the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set, including:
the data cleaning module 33 performs semantic analysis on the field to be cleaned, acquires a tensor field set corresponding to the field to be cleaned in the tensor field set according to the field type, further acquires a tensor field related to the field to be cleaned in the tensor field set, and performs data cleaning on the field to be cleaned by using the tensor field;
wherein the field category may be a field attribute; such as: a set of tensor fields associated with traffic or a set of tensor fields associated with a terminal;
the tensor fields in the tensor field set related to the field to be cleaned can be: tensor fields in the tensor field set are basically consistent with the semanteme of the fields to be cleaned; such as: tensor fields which represent the same attribute with the field to be cleaned, such as user terminal type mobile _ type;
or, a tensor field having a functional dependency relationship with the field to be cleaned, for example, the field X to be cleaned depends on the tensor field Y in the tensor field set;
the data cleaning module 33 performing data cleaning on the field to be cleaned by using the tensor field includes:
the data cleaning module 33 fills the missing value and the repaired error value of the field to be cleaned with the tensor field whose semantics are basically consistent with those of the field to be cleaned, and repairs the inconsistent value with the tensor field having a functional dependency relationship with the field to be cleaned.
Further, the device further comprises an updating module 36, configured to update the cleaned data to the database and record a cleaning log;
here, the cleansing log includes: cleaning time, original data, cleaning operation, cleaned data, a recorder and the like;
wherein the cleaning time is the specific time for executing the data cleaning; the original data is data before cleaning; the cleaning operation is a specific cleaning operation of the data to be cleaned, such as: delete, modify, etc.;
and recording the cleaning log so as to facilitate the subsequent quality analysis of the data, the restoration of the original data and the like.
In the embodiment of the present invention, the data cleaning apparatus may be located in a server, and the obtaining module 31, the processing module 32, the data cleaning module 33, the logging module 34, the establishing module 35, and the updating module 36 may all be implemented by a Central Processing Unit (CPU) in the server, or a Digital Signal Processor (DSP), or a Field Programmable Gate Array (FPGA).
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (10)
1. A method of data cleansing, the method comprising:
acquiring data to be cleaned, and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
searching expandable dimension fields in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension fields to obtain M tensor field sets; wherein M is a positive integer;
and carrying out data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
2. The method of claim 1, wherein after the obtaining the data to be cleaned, the method further comprises: and inputting the data to be cleaned into an established database, and optimizing the database to obtain an original database.
3. The method according to claim 1 or 2, wherein the obtaining the field to be cleaned in the data to be cleaned according to the analysis of the distribution of the noise data in the data to be cleaned comprises:
acquiring the probability P of the occurrence of a noise data value in any field of the data to be cleaned in a specified time period, wherein the P is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
4. The method of claim 1 or 2, wherein performing a higher-order tensor expansion on the expandable dimension field to obtain M sets of tensor fields comprises:
and sequentially carrying out high-order tensor dimension expansion on the expandable dimension fields by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classifying the tensor fields into M tensor field sets according to field semantic similarity.
5. The method according to claim 1 or 2, wherein the data cleaning of the field to be cleaned by using the tensor field related to the field to be cleaned in the tensor field set comprises:
and carrying out semantic analysis on the field to be cleaned, acquiring a tensor field set corresponding to the field to be cleaned in the tensor field set according to field types, further acquiring a tensor field relevant to the field to be cleaned in the tensor field set, and carrying out data cleaning on the field to be cleaned by utilizing the tensor field.
6. A data cleansing apparatus, said apparatus comprising: the system comprises an acquisition module, a processing module and a data cleaning module; wherein,
the acquisition module is used for acquiring data to be cleaned and acquiring fields to be cleaned in the data to be cleaned according to analysis of noise data distribution in the data to be cleaned;
the processing module is used for searching the expandable dimension field in the data to be cleaned, and performing high-order tensor dimension expansion on the expandable dimension field to obtain M tensor field sets; wherein M is a positive integer;
and the data cleaning module is used for performing data cleaning on the field to be cleaned by utilizing the tensor field related to the field to be cleaned in the tensor field set.
7. The apparatus of claim 6, further comprising: and the entry module is used for entering the data to be cleaned into an established database and optimizing the database to obtain an original database.
8. The device according to claim 6 or 7, wherein the obtaining module is specifically configured to obtain a probability P of occurrence of a noise data value in any field of the data to be cleaned within a specified time period; p is m/n; wherein m is the number of times of occurrence of the noise data value in the specified time period, and n is the total number of data records in the specified time period;
determining that the value of the probability P is greater than a preset threshold P0And marking the field to which the noise data belongs as a field to be cleaned.
9. The apparatus according to claim 6 or 7, wherein the processing module is specifically configured to perform high-order tensor dimension expansion on the expandable dimension fields in sequence by using a tensor decomposition algorithm to obtain a plurality of tensor fields, and classify the plurality of tensor fields into M tensor field sets according to field semantic similarity.
10. The apparatus according to claim 6 or 7, wherein the data cleaning module is specifically configured to perform semantic analysis on the field to be cleaned, obtain a tensor field set corresponding to the field to be cleaned in the tensor field set according to a field type, further obtain a tensor field related to the field to be cleaned in the tensor field set, and perform data cleaning on the field to be cleaned by using the tensor field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410503126.0A CN105468658B (en) | 2014-09-26 | 2014-09-26 | Data cleaning method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410503126.0A CN105468658B (en) | 2014-09-26 | 2014-09-26 | Data cleaning method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105468658A true CN105468658A (en) | 2016-04-06 |
CN105468658B CN105468658B (en) | 2020-04-03 |
Family
ID=55606362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410503126.0A Active CN105468658B (en) | 2014-09-26 | 2014-09-26 | Data cleaning method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468658B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101507A (en) * | 2017-06-20 | 2018-12-28 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
CN110596594A (en) * | 2019-09-23 | 2019-12-20 | 广东毓秀科技有限公司 | Method for predicting SOE of rail-traffic lithium battery through big data |
CN114880314A (en) * | 2022-05-23 | 2022-08-09 | 烟台聚禄信息科技有限公司 | Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070150239A1 (en) * | 1997-01-21 | 2007-06-28 | Hadassa Degani | Apparatus for monitoring a system with time in space and method therefor |
CN101593214A (en) * | 2008-05-28 | 2009-12-02 | 美国日本电气实验室公司 | Be used to handle the system and method for high dimensional data |
US7904583B2 (en) * | 2003-07-11 | 2011-03-08 | Ge Fanuc Automation North America, Inc. | Methods and systems for managing and controlling an automation control module system |
CN101986296A (en) * | 2010-10-28 | 2011-03-16 | 浙江大学 | Noise data cleaning method based on semantic ontology |
CN103473375A (en) * | 2013-09-29 | 2013-12-25 | 方正国际软件有限公司 | Data cleaning method and data cleaning system |
-
2014
- 2014-09-26 CN CN201410503126.0A patent/CN105468658B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070150239A1 (en) * | 1997-01-21 | 2007-06-28 | Hadassa Degani | Apparatus for monitoring a system with time in space and method therefor |
US7904583B2 (en) * | 2003-07-11 | 2011-03-08 | Ge Fanuc Automation North America, Inc. | Methods and systems for managing and controlling an automation control module system |
CN101593214A (en) * | 2008-05-28 | 2009-12-02 | 美国日本电气实验室公司 | Be used to handle the system and method for high dimensional data |
CN101986296A (en) * | 2010-10-28 | 2011-03-16 | 浙江大学 | Noise data cleaning method based on semantic ontology |
CN103473375A (en) * | 2013-09-29 | 2013-12-25 | 方正国际软件有限公司 | Data cleaning method and data cleaning system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101507A (en) * | 2017-06-20 | 2018-12-28 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
CN109101507B (en) * | 2017-06-20 | 2023-09-26 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
CN110596594A (en) * | 2019-09-23 | 2019-12-20 | 广东毓秀科技有限公司 | Method for predicting SOE of rail-traffic lithium battery through big data |
CN114880314A (en) * | 2022-05-23 | 2022-08-09 | 烟台聚禄信息科技有限公司 | Big data cleaning decision-making method applying artificial intelligence strategy and AI processing system |
Also Published As
Publication number | Publication date |
---|---|
CN105468658B (en) | 2020-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Juddoo | Overview of data quality challenges in the context of Big Data | |
TWI640876B (en) | System and method for performing set operations with defined sketch accuracy distribution | |
CN113760891B (en) | Data table generation method, device, equipment and storage medium | |
CN105468677A (en) | Log clustering method based on graph structure | |
CN116881430B (en) | Industrial chain identification method and device, electronic equipment and readable storage medium | |
CN109558166B (en) | Code searching method oriented to defect positioning | |
KR102345410B1 (en) | Big data intelligent collecting method and device | |
CN105468658B (en) | Data cleaning method and device | |
CN112905380A (en) | System anomaly detection method based on automatic monitoring log | |
US20150113008A1 (en) | Providing automatable units for infrastructure support | |
CN106935038B (en) | Parking detection system and detection method | |
CN111709775A (en) | House property price evaluation method and device, electronic equipment and storage medium | |
CN115422201A (en) | Layer-level data analysis method and device and electronic equipment | |
CN115878599A (en) | Sewage industry data cleaning method | |
CN106776704B (en) | Statistical information collection method and device | |
CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
CN103475532A (en) | Hardware detection method and system thereof | |
CN115982655A (en) | Missing data flow abnormity prediction method based on decision tree | |
CN114218383A (en) | Method, device and application for judging repeated events | |
CN117609278A (en) | Multi-mode power data management method and system based on deep measurement learning | |
CN113779261A (en) | Knowledge graph quality evaluation method and device, computer equipment and storage medium | |
CN105677723A (en) | Method for establishing and searching data labels for industrial signal source | |
CN113127460A (en) | Evaluation method of data cleaning frame, device, equipment and storage medium thereof | |
CN114722960A (en) | Method and system for detecting incomplete track of event log in business process | |
CN111291376B (en) | Web vulnerability verification method based on crowdsourcing and machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |