CN113609110A

CN113609110A - Data cleaning method and device and computer storage medium

Info

Publication number: CN113609110A
Application number: CN202110757845.5A
Authority: CN
Inventors: 范启辉; 李金金; 张婷; 谢子秋
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-11-05

Abstract

The application provides a data cleaning method, a data cleaning device and a computer storage medium, which comprise the steps of obtaining an original data cleaning rule and a target data cleaning rule, cleaning original data in an original data structure according to the original data cleaning rule to obtain a target data structure comprising target data, cleaning the target data which is matched with the target data cleaning rule according to the target data structure and the target data cleaning rule if the target data structure has the target data which is matched with the target data cleaning rule, and repeatedly performing the cleaning step of the target data until the target data which is matched with the target data cleaning rule does not exist in the target data structure. Therefore, the method and the device can enable the structured data from different sources to have uniform management specifications and save data storage space.

Description

Data cleaning method and device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of data management, in particular to a data cleaning method and device and a computer storage medium.

Background

With the rapid development of the internet of things, artificial intelligence and data storage technologies, the data of each industry has the following characteristics: on the one hand, the data volume grows exponentially. On the other hand, the development of the popular digital technology and modern computing power of the industrial internet of things enables data to present complex, multi-source and diverse structural characteristics.

In an actual application scenario, due to uncertain factors such as hardware performance and robustness of a software program, data can be lost or abnormal at any time in a network transmission process, and error data is acquired. In addition, the hardware devices are various, so that the acquired data is usually irregular, and a plurality of obstacles exist in the subsequent use of the data.

At present, in the internet field, in each link of a large-scale engineering project development process, the size of data volume exceeds the storage and management range of a current database, wrong data not only occupies a large amount of storage space, but also influences the accuracy of business decision.

According to the survey, the structured data is widely applied to various industry fields due to the characteristics of easy storage, query and modification, and statistically, more than 40% of the project time of the structured data is used for data preprocessing by 80% of researchers, which means that great market demands are made on data processing modes, and in addition, the storage and the indexing of the structured data are another problem to be solved due to the characteristics of big data.

Most of the currently used data preprocessing technologies process structured data according to information such as structural attributes and time log sequences, how to screen useful and high-quality data from massive structured data, reduce data dimensionality and reduce data volume, and have important significance for subsequent data processing.

Data cleaning is widely applied to various industries as an effective technology for improving data quality, however, with the development of the times, the form of error data presents diversified characteristics, and the traditional data cleaning method cannot meet the requirement of current big data.

Disclosure of Invention

In view of the foregoing problems, the present application provides a data cleansing method, an apparatus, and a computer storage medium, which perform cleansing on structured data from different sources according to predefined data cleansing rules, so that original data and predefined data structures are kept consistent, and uniform management of data is facilitated.

A first aspect of the present application provides a data cleansing method, including: acquiring an original data cleaning rule and a target data cleaning rule; an original data cleaning step, wherein cleaning is performed on original data in an original data structure according to the original data cleaning rule to obtain a target data structure comprising target data; and a target data cleaning step, wherein according to the target data structure and the target data cleaning rule, if the target data structure has the target data which is matched with the target data cleaning rule, cleaning is performed on the target data which is matched with the target data structure according to the target data cleaning rule, and the target data cleaning step is repeatedly performed until the target data which is matched with the target data cleaning rule does not exist in the target data structure.

A second aspect of the present application provides a computer storage medium, wherein instructions for executing the steps of the data cleansing method according to the first aspect are stored in the computer storage medium.

A third aspect of the present application provides a data washing apparatus, comprising: the rule obtaining module is used for obtaining an original data cleaning rule and a target data cleaning rule; the data cleaning module is used for executing an original data cleaning step to clean original data in an original data structure according to the original data cleaning rule to obtain a target data structure comprising target data, selectively executing a target data cleaning step to clean the target data which is matched with the target data cleaning rule according to the target data structure and the target data cleaning rule if the target data which is matched with the target data cleaning rule exists in the target data structure, and repeatedly executing the target data cleaning step until the target data which is matched with the target data cleaning rule does not exist in the target data structure. .

In summary, the data cleaning method, the data cleaning device and the computer storage medium provided by the application perform cleaning aiming at original data structures of different sources by predefining data cleaning rules, so as to obtain a uniform target data structure.

Moreover, different data cleaning rules can be defined according to different scenes, and the method and the device have better flexibility and expandability.

In addition, the data cleaning technology can remove redundant data in the original data structure, ensure that the generated target data structure is simpler, and save the data storage space.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic flow chart of a data cleaning method according to a first embodiment of the present application.

Fig. 2 is a schematic flow chart of a data cleaning method according to a second embodiment of the present application.

FIG. 3 is a schematic diagram illustrating a data cleansing apparatus according to a fourth embodiment of the present application.

Element number

300: a data washing device; 302: a rule acquisition module; 304: and a data cleaning module.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

First embodiment

Fig. 1 shows a schematic flow chart of a data cleansing method according to a first embodiment of the present application. As shown in the figure, the data cleaning method of the embodiment mainly includes the following steps:

step S102, an original data cleaning rule and a target data cleaning rule are obtained.

Alternatively, different raw data cleansing rules and target data cleansing rules may be defined according to different scenarios.

In this embodiment, different cleansing rules may correspond to different types of data structures.

In this embodiment, the raw data cleansing rule may include a raw data parity sub-rule and a raw data mapping sub-rule.

In this embodiment, the target data cleansing rule includes a target data syndrome rule.

In addition, the target data cleansing rule may optionally include a target data mapping sub-rule.

And step S104, cleaning the original data in the original data structure according to the original data cleaning rule to obtain a target data structure comprising target data.

Optionally, according to the original data syndrome rule and the original data mapping rule in the original data cleansing rule, checksum mapping processing may be sequentially performed on the original data in the original data structure to obtain a target data structure including target data.

For example, the original data that conforms to the original data cleansing rules (e.g., field names or field values of original fields in the original data structure) may be transformed, supplemented, and mapped, and the original data that does not conform to the original data cleansing rules may be marked to facilitate subsequent error troubleshooting and repair processes for the entire data system.

And step S106, judging whether target data matched with the target data cleaning rule exists in the target data structure, if so, executing step S108, and if not, exiting the step.

Alternatively, it may be determined whether target data matching the target data cleaning rule exists in the target data structure, if so, step S108 is executed, and if not, this step may be exited to represent that the data cleaning is completed.

And step S108, cleaning the matched target data according to the target data cleaning rule, and returning to the step S106.

In this embodiment, after the cleaning of the target data in the target data structure matching the target data cleaning rule is completed according to the target data cleaning rule, the step S106 is returned to detect whether there is any target data matching the target data cleaning rule in the current target data structure again until there is no target data matching the target data cleaning rule in the target data structure.

In summary, the data cleaning rule provided in this embodiment may perform cleaning on the original data structures from different sources according to the preset original data cleaning rule and the target data cleaning rule, so as to obtain a uniform target data structure, thereby implementing uniform management of data, so as to facilitate improvement of data processing and analysis efficiency, and in addition, redundant data in the original data structure may be removed through data cleaning processing, so as to save data storage space and facilitate data maintenance.

Second embodiment

Fig. 2 is a schematic flow chart of a data cleansing method according to a second embodiment of the present application. As shown in the figure, the data cleaning method of the embodiment mainly includes the following steps:

step S202, according to the original data syndrome rule in the original data cleaning rule, the original data in the original data structure is verified.

In this embodiment, the raw data syndrome rule refers to a rule for performing comparison, conversion or screening on each raw field in the raw data structure.

Optionally, the original data syndrome rule is used to define field names and field types corresponding to original fields in the original data structure, that is, the original data syndrome rule specifies original fields that should be included in the original data structure, and data types (e.g., a value type or a string type) that each original field should possess.

Optionally, according to the original data syndrome rule, verifying the original data structure may include: at least one of existence check is executed for each field name corresponding to each original field, field type check is executed for each field type corresponding to each original field, value check is executed for each field value corresponding to each original field, and uniqueness check is executed for each field value corresponding to each original field.

In this embodiment, the presence check is used to check whether the original field should exist.

In this embodiment, the field type check is used to determine whether the field type of the original field is consistent with the field type specified in the original data syndrome rule, if the two types are not consistent, the conversion process may be attempted first, and if the conversion is not possible, the check fails, where the field types include, but are not limited to, the types shown in table 1 below:

type of field	Description of the invention
		cds_str	Character string type
cds_num	Type of value
		cds_map	Dictionary type
cds_arr	Array type
		cds_bool	Type of Boolean value

(Table 1)

In this embodiment, the value check is used to check the field value of the original field.

Alternatively, the value check may be implemented by a predetermined check function.

In this embodiment, the verification functions that can be supported include, but are not limited to, the types shown in table 2:

(Table 2)

In this embodiment, the uniqueness check is used to check whether the field value under a certain original field is unique.

Optionally, some additional check rules may be set for each field value corresponding to each original field.

For example, the field value of the original field of "examination date" may be uniformly formatted into a data format of "yyyy-MM-dd".

For another example, an upper or lower limit value of the field value of the original field may be defined (e.g., setting the field value of the original field must be between 0 and 100).

In addition, a default substitute value may be defined when there is a missing or error in the field value of the original field, for example, when there is a missing or error in the field value of some original field, a default value "0" is filled.

In step S204, according to the original data mapping sub-rule in the original data cleaning rule, mapping is performed on the verified original data structure to obtain a target data structure (also referred to as a primary cleaning data structure).

Optionally, the original data mapping sub-rule is used to define a mapping relationship between each original field in the original data structure and each target field in the target data structure.

Optionally, each verified original field may be mapped to each corresponding target field according to the original data mapping sub-rule, so as to obtain a target data structure (i.e., a one-time cleaned data structure).

Optionally, the original data mapping sub-rule may define storage location information corresponding to original fields in the original data structure.

Alternatively, the field name or field value of the original field corresponding to the storage location information may be obtained from the original data structure according to the storage location information, so as to fill the field name or field value of each target field in the target data structure (i.e. the one-time cleaning data structure).

Optionally, a node selector may be utilized to find corresponding raw data from the raw data structure according to the storage location information.

For example, a specified node may be selected by setting an absolute path or a relative path, and the node selector may be provided to obtain a field name or a field value corresponding to the specified node in cooperation with a predefined function such as @ fieldName, @ get, and the like. Wherein, the absolute path is to search nodes meeting the conditions from the root node of the original data structure; relative path refers to finding its parent node or descendant node from a specified node, and the illustrative example is as shown in table 3 below:

(Table 3)

In this embodiment, the related representation of the predefined function is shown in the following table 4:

(Table 4)

Step S206, judging whether target data matched with the target data syndrome rule exists in the target data structure, if so, performing step S2081, and if not, performing step S2082.

In this embodiment, the target data cleansing rule at least includes a target data syndrome rule.

Specifically, it may be determined whether target data that matches the target data syndrome rule exists in the target data structure (e.g., a one-time cleaned data structure), and if so, an analysis result that the target data structure needs to be cleaned is obtained, that is, target data that is not cleaned yet exists in the current target data structure, and if not, an analysis result that the target data structure does not need to be cleaned is obtained, that is, the target data in the current target data structure is completely cleaned.

Step S2081, according to the target data syndrome rule, checking the target field in the target data structure (e.g., the one-time cleaning data structure) that matches the target field, and executing step S210.

In this embodiment, the verification step for the target data structure (e.g., the one-time cleaning data structure) is the same as the verification step for the original data structure, and reference may be specifically made to step S202, which is not described herein again.

Step S2082, outputting the current target data structure (e.g., the one-time cleaning data structure) and ending the target data cleaning step.

Step S210, querying whether a target data mapping sub-rule exists in the target data cleaning rule, if so, executing step S2121, and if not, executing step S2122.

Step S2121, according to the queried target data mapping sub-rule, perform mapping on the verified target data structure (e.g., the verified primary cleaning data structure) to update the current target data structure (which may also be regarded as performing mapping on the primary cleaning data structure to obtain a secondary cleaning data structure), and return to perform step S206.

Specifically, after the mapping process of the current target data structure (e.g., the secondary cleaning data structure) is completed, the process returns to step S206 again to continuously determine whether target data that matches the target data cleaning rule exists in the current target data structure (e.g., the secondary cleaning data structure), that is, whether target data that does not match the target data cleaning rule exists in the current target data structure (e.g., the secondary cleaning data structure), and sequentially loops until target data that matches the target data cleaning rule does not exist in the target data structure.

In this embodiment, the mapping step of the target data structure (e.g., the one-time cleaning data structure) is the same as the mapping step of the original data structure, and please refer to step S204 for details, which is not described herein again.

Step S2122, ending the target data cleaning step and outputting the verified target data structure (e.g., the verified once cleaned data structure).

In summary, the data cleaning method provided in the embodiment of the present application can perform cleaning processing including field verification and field mapping on original data structures from different sources according to the data cleaning rule to obtain a target data structure with a uniform format, which not only facilitates uniform management of data, but also removes redundant data information in the original data structure through data cleaning, thereby saving data storage space.

Third embodiment

A third embodiment of the present application provides a computer storage medium having stored therein instructions for executing the steps of the data cleansing method in the first or second embodiment described above.

Fourth embodiment

FIG. 3 is a schematic diagram of a data cleansing apparatus according to a fourth embodiment of the present application. As shown, the data cleansing apparatus 300 of the embodiment of the present application mainly includes a rule obtaining module 302 and a data cleansing module 304.

The rule obtaining module 302 is configured to obtain an original data cleansing rule and a target data cleansing rule.

Optionally, the original data syndrome rule is used to define each field name and each field type corresponding to each original field in the original data structure.

Optionally, the original data mapping sub-rule further defines storage location information corresponding to each original field in the original data structure.

Optionally, the target data cleansing rule at least comprises a target data syndrome rule.

Optionally, the target data cleansing rule may further optionally include a target data mapping sub-rule.

The data cleansing module 304 is configured to perform an original data cleansing step to perform cleansing on original data in an original data structure according to the original data cleansing rule to obtain a target data structure including target data, and the data cleansing module 304 is further configured to selectively perform a target data cleansing step to perform cleansing on the target data structure according to the target data structure and the target data cleansing rule and repeatedly perform the target data cleansing step until the target data structure does not have the target data matching the target data cleansing rule if the target data matching the target data cleansing rule exists in the target data structure.

Optionally, the data cleansing module 304 further comprises: at least one of existence check is executed for each field name corresponding to each original field, field type check is executed for each field type corresponding to each original field, value check is executed for each field value corresponding to each original field, and uniqueness check is executed for each field value corresponding to each original field.

Optionally, the target data structure comprises a primary cleaning data structure, and the data cleaning module 304 further comprises verifying the original data structure according to the original data syndrome rule; and according to the original data mapping sub-rule, mapping is executed aiming at the verified original data structure, and the primary cleaning data structure is obtained.

Optionally, the data cleansing module 304 further comprises: and mapping each verified original field to each corresponding target field according to the original data mapping sub-rule to obtain the primary cleaning data structure.

Optionally, the data cleansing module 304 further comprises: and according to the storage position information, acquiring the field name or the field value of the original field corresponding to the storage position information from the original data structure, and filling the field name or the field value of the target field in the primary cleaning data structure.

Optionally, the target data structure further includes a secondary cleansing data structure, and the data cleansing module 304 further includes: according to the primary cleaning data structure and the target data cleaning rule, if the target data which is identical with the target data cleaning rule exists in the primary cleaning data structure, obtaining an analysis result of the primary cleaning data structure which needs to be cleaned, and if the target data which is identical with the target data cleaning rule does not exist in the primary cleaning data structure, obtaining an analysis result of the primary cleaning data structure which does not need to be cleaned; if the primary cleaning data structure needs cleaning, cleaning the target data matched with the primary cleaning data structure according to the target data cleaning rule; and if the primary cleaning data structure does not need cleaning, finishing the target data cleaning step and outputting the first cleaning data structure.

Optionally, the data cleansing module 304 further comprises: according to the target data syndrome rule, checking the target field matched with the primary cleaning data structure in the primary cleaning data structure; and the verification step of the primary cleaning data structure is the same as the verification step of the original data structure.

Optionally, the data cleansing module 304 further comprises: after the step of checking the target field which is matched with the primary cleaning data structure, inquiring the target data mapping sub-rule in the target data cleaning rule, if the target data mapping sub-rule is not inquired, ending the target data cleaning step and outputting the checked first cleaning data structure; if the target data mapping sub-rule is inquired, mapping is carried out on the verified primary cleaning data structure according to the target data mapping sub-rule so as to obtain a secondary cleaning data structure; wherein the mapping step of the primary cleaning data structure is the same as the mapping step of the original data structure.

Optionally, the data cleansing module 304 further comprises: obtaining a judgment result that the secondary cleaning data structure needs cleaning or does not need cleaning according to the target data cleaning rule and the secondary cleaning data structure; the step of obtaining the judgment result that the secondary cleaning data structure needs cleaning or does not need cleaning is the same as the step of obtaining the judgment result that the primary cleaning data structure needs cleaning or does not need cleaning; and when an analysis result of the secondary cleaning data structure needing to be cleaned is obtained, the cleaning step of the secondary cleaning data structure is the same as the cleaning step of the primary cleaning data structure.

In addition, the data cleaning apparatus of the embodiment of the present invention may also be used to implement other steps in the foregoing data cleaning method embodiments, and has the beneficial effects of the corresponding method step embodiments, which are not described herein again.

In summary, embodiments of the present application provide a technical solution for cleaning structured data, which can convert, supplement and map original data in an original data structure conforming to a cleaning rule to keep the original data consistent with a predefined target data structure, so as to facilitate uniform management of data, and meanwhile, in a data cleaning process, redundant data in the original data structure can be removed, so that a data storage space can be saved.

Moreover, the data cleaning method provided by the application can define different data cleaning rules based on different application scenes, and has the advantages of wide application range and strong expansibility.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for data cleansing, comprising:

acquiring an original data cleaning rule and a target data cleaning rule;

an original data cleaning step, wherein cleaning is performed on original data in an original data structure according to the original data cleaning rule to obtain a target data structure comprising target data;

and a target data cleaning step, wherein according to the target data structure and the target data cleaning rule, if the target data structure has the target data which is matched with the target data cleaning rule, cleaning is performed on the target data which is matched with the target data structure according to the target data cleaning rule, and the target data cleaning step is repeatedly performed until the target data which is matched with the target data cleaning rule does not exist in the target data structure.

2. The data cleansing method of claim 1, wherein the target data structure comprises a one-time cleansing data structure, and the raw data cleansing rule comprises at least a raw data syndrome rule and a raw data mapping sub-rule;

wherein the raw data cleaning step comprises:

according to the original data syndrome rule, checking the original data structure;

and according to the original data mapping sub-rule, mapping is executed aiming at the verified original data structure, and the primary cleaning data structure is obtained.

3. The data cleaning method according to claim 2, wherein the original data syndrome rule is used for defining field names and field types corresponding to original fields in the original data structure;

the verifying the original data structure according to the original data syndrome rule includes: at least one of existence check is executed for each field name corresponding to each original field, field type check is executed for each field type corresponding to each original field, value check is executed for each field value corresponding to each original field, and uniqueness check is executed for each field value corresponding to each original field.

4. The data cleansing method according to claim 3, wherein the raw data mapping sub-rule is used to define a mapping relationship between each raw field in the raw data structure and each target field in the target data structure;

the mapping, performed on the verified original data structure according to the original data mapping sub-rule, to obtain the primary cleaning data structure includes:

and mapping each verified original field to each corresponding target field according to the original data mapping sub-rule to obtain the primary cleaning data structure.

5. The method of claim 4, wherein the raw data mapping sub-rule further defines storage location information corresponding to each raw field in the raw data structure, and the method further comprises:

and according to the storage position information, acquiring the field name or the field value of the original field corresponding to the storage position information from the original data structure, and filling the field name or the field value of the target field in the primary cleaning data structure.

6. The data cleansing method of claim 4, wherein the target data structure further comprises a secondary cleansing data structure, the target data cleansing step further comprising:

according to the primary cleaning data structure and the target data cleaning rule, if the target data which is identical with the target data cleaning rule exists in the primary cleaning data structure, obtaining an analysis result of the primary cleaning data structure which needs to be cleaned, and if the target data which is identical with the target data cleaning rule does not exist in the primary cleaning data structure, obtaining an analysis result of the primary cleaning data structure which does not need to be cleaned; wherein the content of the first and second substances,

if the primary cleaning data structure needs cleaning, according to the target data cleaning rule, cleaning the target data matched with the primary cleaning data structure in the primary cleaning data structure;

and if the primary cleaning data structure does not need cleaning, finishing the target data cleaning step and outputting the first cleaning data structure.

7. The data cleansing method of claim 6, wherein the target data cleansing rule comprises at least a target data syndrome rule;

wherein the performing of cleaning on the target data in the primary cleaning data structure according to the target data cleaning rule comprises:

according to the target data syndrome rule, checking the target field matched with the primary cleaning data structure in the primary cleaning data structure;

and the verification step of the primary cleaning data structure is the same as the verification step of the original data structure.

8. The data cleansing method of claim 7, wherein the target data cleansing rules may further optionally include target data mapping sub-rules;

wherein, after the step of checking the target field in the one-time washing data structure that matches therewith, the method further comprises:

querying the target data mapping sub-rule in the target data cleaning rule;

if the target data mapping sub-rule is not inquired, finishing the target data cleaning step and outputting the verified first cleaning data structure;

if the target data mapping sub-rule is inquired, mapping is carried out on the verified primary cleaning data structure according to the target data mapping sub-rule so as to obtain a secondary cleaning data structure;

wherein the mapping step of the primary cleaning data structure is the same as the mapping step of the original data structure.

9. The data cleansing method of claim 8, further comprising:

obtaining a judgment result that the secondary cleaning data structure needs cleaning or does not need cleaning according to the target data cleaning rule and the secondary cleaning data structure; wherein the content of the first and second substances,

the step of obtaining the judgment result that the secondary cleaning data structure needs cleaning or does not need cleaning is the same as the step of obtaining the judgment result that the primary cleaning data structure needs cleaning or does not need cleaning; and wherein the one or more of the one or more,

and when an analysis result of the secondary cleaning data structure needing cleaning is obtained, the cleaning step of the secondary cleaning data structure is the same as the cleaning step of the primary cleaning data structure.

10. A computer storage medium having stored therein instructions for carrying out the steps of the data cleansing method according to any one of claims 1 to 9.

11. A data cleansing apparatus, comprising:

the rule obtaining module is used for obtaining an original data cleaning rule and a target data cleaning rule;

the data cleaning module is used for executing an original data cleaning step to clean original data in an original data structure according to the original data cleaning rule to obtain a target data structure comprising target data, selectively executing a target data cleaning step to clean the target data which is matched with the target data cleaning rule according to the target data structure and the target data cleaning rule if the target data which is matched with the target data cleaning rule exists in the target data structure, and repeatedly executing the target data cleaning step until the target data which is matched with the target data cleaning rule does not exist in the target data structure.