CN112199366A

CN112199366A - Data table processing method, device and equipment

Info

Publication number: CN112199366A
Application number: CN202011183142.8A
Authority: CN
Inventors: 张俊鹏; 甘长华; 方薇; 汪发佳
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2021-01-08
Also published as: CN109977110A; CN109977110B

Abstract

The embodiment of the invention provides a data table processing method, a data table processing device and data table processing equipment. The method and the device for processing the metadata of the data cleaning equipment send the target metadata information of the target data table in the designated database to the data cleaning equipment, so that the data cleaning equipment copies the content of the target data table to an intermediate table in a data warehouse according to the target metadata information, and sends a metadata modification instruction to the data cleaning equipment after copying is completed, so that the data cleaning equipment modifies the table field designated attribute in the metadata information of the intermediate table according to the metadata modification instruction, and sends a metadata confirmation instruction to the data cleaning equipment after modification is completed, so that the data cleaning equipment confirms the metadata information of the intermediate table according to the metadata confirmation instruction, and can modify the main key column of the intermediate table in the data warehouse to meet the requirements of the same data table in different service scenes.

Description

Data table processing method, device and equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a device for processing a data table.

Background

At present, big data has been widely regarded. Through the analysis of big data, much valuable information can be obtained. However, various data in big data inevitably has various quality problems such as data loss, data duplication, data non-compliance, data incompleteness, data expiration, and the like, subject to various influences in the data recording and management process. Since the value and the data quality of the big data have a positive correlation, it is necessary to perform necessary processing on the big data to improve the data quality of the big data in order to extract information with high value from the big data.

Data cleansing is one of the methods to improve the data quality of big data. In the related art, the data cleansing process is as follows: for each data table in the original database, manually configuring a cleaning task of the data table in a manual mode, writing the cleaning task into a manually written SQL script corresponding to the original data table through a Structured Query Language (SQL) statement, and respectively operating the corresponding SQL script for each table to generate a standard table and store the standard table in the original database. In the technology, the data cleaning process needs to manually configure the cleaning task of each data table, and the time consumption is long, so the efficiency is low.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a data table processing method, a data table processing device and data table processing equipment.

According to a first aspect of the embodiments of the present invention, there is provided a data table processing method, including:

sending target metadata information of a target data table in a specified database to data cleaning equipment so that the data cleaning equipment can copy the content of the target data table to an intermediate table in a data warehouse according to the target metadata information; the target metadata information includes a table field designation attribute;

after the copying is finished, sending a metadata modification instruction to the data cleaning equipment so that the data cleaning equipment can modify the table field specified attribute in the metadata information of the intermediate table according to the metadata modification instruction;

and after the modification, sending a metadata confirmation instruction to the data cleaning equipment so that the data cleaning equipment confirms the metadata information of the intermediate table according to the metadata confirmation instruction.

According to a second aspect of the embodiments of the present invention, there is provided a data table processing apparatus, the apparatus including:

the metadata information sending module is used for sending target metadata information of a target data table in a specified database to the data cleaning equipment so that the data cleaning equipment can copy the content of the target data table to a middle table in a data warehouse according to the target metadata information; the target metadata information includes a table field designation attribute;

the metadata modification module is used for sending a metadata modification instruction to the data cleaning equipment after the copying is finished so that the data cleaning equipment can modify the table field specified attribute in the metadata information of the intermediate table according to the metadata modification instruction;

and the metadata confirmation module is used for sending a metadata confirmation instruction to the data cleaning equipment after the modification is finished so that the data cleaning equipment can confirm the metadata information of the intermediate table according to the metadata confirmation instruction.

According to a third aspect of embodiments of the present invention, there is provided a data table processing apparatus, comprising a processor and a memory for storing executable instructions of the processor;

the processor is configured to:

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the data table processing method provided by the embodiment of the invention sends the target metadata information of the target data table in the specified database to the data cleaning equipment, so that the data cleaning equipment copies the content of the target data table to an intermediate table in a data warehouse according to the target metadata information, and sends a metadata modification instruction to the data cleaning equipment after the copying is finished, so that the data cleaning equipment sends a metadata confirmation instruction to the data cleaning equipment after modifying the table field specified attribute in the metadata information of the intermediate table according to the metadata modification instruction, therefore, the data cleaning equipment can confirm the metadata information of the intermediate table according to the metadata confirmation instruction and modify the main key column of the intermediate table in the data warehouse so as to meet the requirements of the same data table in different service scenes.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a diagram illustrating an application scenario of a data cleansing method according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a data cleansing method according to an embodiment of the present invention.

FIG. 3 is a functional block diagram of a data cleansing apparatus according to an embodiment of the present invention.

Fig. 4 is a hardware configuration diagram of a data cleansing apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of embodiments of the invention, as detailed in the following claims.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Big data has important application value in all aspects. Enterprise customers can make decisions by using the analysis results of the big data, and the decisions can assist the enterprises to adjust marketing strategies; the demand of common consumers on big data is mainly embodied in that information can be searched as required, friendly and credible information recommendation can be provided, and high-order service is provided; aiming at the harmonious society, big data can bring a brand-new life style, namely a smart city, the construction of the smart city is a complex system engineering, and the big data can provide data support for the smart city.

In order to obtain valuable information from the big data, the big data needs to be treated, and the data quality of the big data needs to be improved. Data cleaning is one of the ways of treating big data.

Data scrubbing refers to the discovery and correction of recognizable data errors in a data table. Data cleansing may include checking for data consistency, handling invalid and missing values, and the like. The data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, so that the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are unsatisfactory data and are called dirty data is avoided. The task of data cleaning is to filter data which does not meet the requirements, and put the data which meets the requirements in a standard table to return to the original database for storage. The data which is not qualified mainly comprises three categories of incomplete data, error data and repeated data.

At present, data of many governments, organizations and enterprises are still stored in traditional transactional databases, such as Mysql databases, Oracle databases, Pgsql databases and the like. In the related art, when data cleansing is performed, a data warehouse is directly built on an original database in an SQL script manner. This, on the one hand, reduces the performance of the original database and, on the other hand, does not facilitate systematic maintenance and tracking of the data. Moreover, because a developer is required to manually configure a cleaning task for each data sheet, and in most cases, the developer must continuously try to clearly know which cleaning operations the data sheet needs to perform, the processing efficiency is very low. In addition, because the data tables are different and the corresponding cleaning tasks are also different, the cleaning tasks can only be configured for each table individually, and the cleaning tasks cannot be configured for a plurality of data tables in batch at the same time, which further reduces the processing efficiency.

In practical applications, the amount of data to be processed is very huge, the objects to be cleaned are often tens of thousands of tables, hundreds of thousands of partitions and hundreds of TBs, and if the data are processed in a related art manner, a considerable number of developers are needed, so that the labor cost is high.

The embodiment of the invention provides a data cleaning method capable of automatically generating a cleaning task according to metadata information of a data table, the cleaning task does not need to be manually configured, the whole cleaning process is automatically carried out, the time required by the cleaning process is greatly shortened, and the processing efficiency is improved. And because the cleaning process does not need manual participation, the demands on development personnel can be reduced, and the labor cost is saved.

Fig. 1 is a diagram illustrating an application scenario of a data cleansing method according to an embodiment of the present invention. Referring to fig. 1, a server a is provided with an original database 1, and a server B is configured to create a data warehouse, for example, create a data warehouse using Hive (a data warehouse tool based on Hadoop), copy a data table in the original database 1 into the data warehouse, perform a cleaning operation on the data table in the data warehouse, and generate a standard table and/or a problem table. Where the criteria table is used to store satisfactory data and/or unsatisfactory data (also referred to herein as issue data) and the issue table is used to store issue data.

The server B can perform data cleaning by using the data cleaning method flow provided by the embodiment of the present invention. Before the cleaning, the server B is configured with the following contents in advance:

1. data source connection information (address, user name, password, etc. of the device where the data source is located, for example, in fig. 1, the data source is the original database 1, and the data source connection information is the address of the server a, the user name and password for accessing the original database 1).

2. The corresponding relation between the table field and the standard data element; the corresponding relation between the standard data element and the cleaning rule; the corresponding relation between the table field and the designated attribute; and (4) corresponding relation between the attributes and the cleaning rules.

According to the first configuration, server B can access the original database 1 to extract the metadata information of the data table from the original database 1 and copy the contents of the data table to the data warehouse local to server B.

According to the second configuration content, the server B may automatically generate a cleaning task corresponding to the data table according to the extracted metadata information, so that the server B automatically performs a cleaning operation on the data table in the local data warehouse according to the generated cleaning task.

The data cleaning method provided by the present invention is explained in detail by the following examples.

Fig. 2 is a flowchart illustrating a data cleansing method according to an embodiment of the present invention. As shown in fig. 2, the data cleansing method may include:

s201, extracting target metadata information of the target data table from the specified database, and copying the content of the target data table to an intermediate table in the data warehouse.

S202, determining a target cleaning task of the target data table according to the target metadata information, wherein the target cleaning task comprises at least one cleaning rule.

S203, cleaning the intermediate table according to each cleaning rule in the target cleaning task to obtain a target standard table, wherein the target standard table at least comprises data meeting requirements in the target data table.

And S204, sending the target standard table to the equipment where the specified database is located, and storing the target standard table to the specified database by the equipment.

The data cleaning method of the embodiment of the invention can be applied to the server B.

In step S201, the metadata information may include general metadata information, such as a table name, a field description, and the like of the data table, and may also include table field specifying attributes, such as whether a field is a primary key, whether a field may be empty, and the like.

The metadata information of each data table includes general metadata information, but does not necessarily include a table field specifying attribute. If one or more fields in some data tables have the designated attributes, the metadata information of the data tables comprises the designated attributes of the table fields; if all fields in some data tables do not have the specified attributes, the metadata information of the data tables does not include the table field specified attributes.

Here, the metadata information is exemplified. Table 1 is a user table, and the contents of table 1 are as follows:

TABLE 1 user table

User identity card number	User name	Age of the user
			xxxxxxxxxxxxxxxxx1	Zhang three	25
xxxxxxxxxxxxxxxxx2	Li four	36
			……	……	……

The metadata information of table 1 is as follows:

table name: a user table;

field name: a user identification number; type (2): string; length: 0 to 18;

field name: a user name; type (2): string; length: 0 to 20;

field name: the age of the user; type (2): string; length: 0-20.

In an exemplary implementation, the extracting the target metadata information of the target data table from the specified database in step S201 may include: calling a Java DataBase connection (JDBC) interface according to the connection information of the designated DataBase configured by the user, and establishing connection with the designated DataBase; and calling a database MetaData interface provided by the JDBC driver package, inquiring the columns of the target data table, and acquiring the table name, the field description and the field designated attributes of the target data table.

For example, the field specifying attribute may be: whether it is a primary key, whether a field may be empty, etc.

In an exemplary implementation, the copying the content of the target data table to the intermediate table in the data warehouse in step S201 may include:

according to the target metadata information, a data warehouse tool is used for locally creating a middle database which has the same name as the designated database where the target data table is located, and a middle table which has the same table name and the same table structure as the target data table is created in the middle database;

and reading the content data stored in the target data table, and writing the read content data into the intermediate table.

Wherein, the data warehouse tool may be Hive.

It should be noted that only one intermediate database needs to be created for all the data tables in one original database.

After copying all the contents of the target data table to the intermediate table, the metadata information of the intermediate table is also stored locally (e.g., in the aforementioned server B). When the table field designation attribute is not included in the metadata information, the metadata information of the intermediate table is identical to the metadata information of the target data table. When the metadata information comprises the table field specified attribute, if the user does not modify the table field specified attribute of the intermediate table, the metadata information of the intermediate table is still completely the same as the metadata information of the target data table; if the user does not modify the table field specifying attribute of the intermediate table, the general metadata information in the metadata information of the intermediate table is the same as the general metadata information in the metadata information of the target data table, and the table field specifying attribute in the metadata information of the intermediate table is different from the table field specifying attribute in the metadata information of the target data table.

In a data cleansing apparatus (e.g., the aforementioned server B), general metadata information of a table may be stored separately from table field designation attributes. The data cleaning device stores general metadata information of the target data table, general metadata information of the intermediate table, table field specifying attributes of the target data table, and table field specifying attributes of the intermediate table. Wherein, the table field appointed attribute is stored in the corresponding relation between the table field and the appointed attribute.

In an exemplary implementation, after copying the contents of the target data table to an intermediate table in the data warehouse, the metadata information may also be confirmed according to a metadata confirmation instruction of the user. At this time, the user does not modify the table field specifying attributes of the table.

In an exemplary implementation process, after copying the content of the target data table to an intermediate table in the data warehouse, the table field specifying attribute in the metadata information of the intermediate table may be modified according to a metadata modification instruction of a user; and after the modification, confirming the metadata according to the metadata confirmation instruction of the user.

In many conventional transactional databases, many data tables are self-increment-based keys, the main keys have no business significance and value, and non-main key fields in the data tables, such as certificate numbers, taxpayer numbers, invoice numbers and the like, are main data in the industry. At this point, the user may modify the primary key column of the intermediate table in the data warehouse according to this example. In a special case, when a certain original data table is referred to in different modeling operations, fields used as primary keys are different, so a scene identification bit can be newly added in the configuration of the primary keys, and the scene identification bit is used for distinguishing different application scenes (GroupId).

In addition, in addition to the primary key, the user may add a designation column of the setting intermediate table as not null through the metadata modification instruction.

In an exemplary implementation, reading the content data stored in the target data table, and writing the read content data into the intermediate table may include: data in the target Data table is extracted through a java naming interface (JDBC Data Source), and the extracted Data is imported Into the intermediate table through a Hive text insertion (hiveContext Insert Into) SQL statement. The function of copying data may be implemented by a Spark application (App).

In step S202, each data table corresponds to one cleaning task, and different data tables correspond to different cleaning tasks.

At least one cleaning rule is included in the cleaning task. The cleansing rule is used for indicating an operation mode for cleansing the data table.

For example, the cleansing rules may be non-numeric value filtering rules, date normalization rules, value range filtering rules, and the like.

In an exemplary implementation, step S202 may include:

searching target data elements matched with all fields in the target metadata information from the established corresponding relation between the table fields and the standard data elements;

searching a first cleaning rule matched with the target data element from the established corresponding relation between the standard data element and the cleaning rule;

and generating a cleaning task according to the first cleaning rule.

The data elements can be industry standard data elements regulated by the state, or can be data elements which are proprietary to the enterprises or government bodies and defined according to historical data and business scenes. The following requirements that the fields should satisfy are defined in the data elements: data type, data format, value range, etc., which are finally analyzed into various standardized cleaning rules through the corresponding relationship between the standard data elements and the cleaning rules.

For example. The above metadata information of table 1 includes 3 fields: user identification number, user name, user age. Assume that the correspondence of table fields to standard data elements is shown in table 2:

TABLE 2 correspondence of Table fields to Standard data elements

Table field	Standard data element
		User identity card number	Identity card number
User name	Character string length
		Age of the user	Maximum and minimum value
……	……

Then, it can be found from table 2 that the standard data element corresponding to the field "user identification number" is "identification number", the standard data element corresponding to the field "user name" is "character string length", and the standard data element corresponding to the field "user age" is "maximum and minimum value".

It is assumed that the standard data elements correspond to the cleansing rules as shown in table 3.

TABLE 3 correspondence of standard data elements to cleaning rules

It can be found from table 3 that the cleansing rule corresponding to the "id card number" of the standard data element is the "rule of converting 15 bits to 18 bits of the id card" and the "validity filtering rule of the id card", the cleansing rule corresponding to the "string length" of the standard data element is the "string length filtering", and the cleansing rule corresponding to the "maximum and minimum value" of the standard data element is the "value range filtering rule".

The cleaning task corresponding to table 1 includes 4 first cleaning rules: the method comprises the following steps of identity card 15-bit to 18-bit rule, identity card validity filtering rule, character string length filtering and value range filtering rule.

The cleansing rules may include two types: filtering rules and normalizing rules. For example, the standardization rule includes a date standardization rule, a synonym standardization rule, an identification card standardization rule, and the like.

In an exemplary implementation process, the target metadata information includes a table field specifying attribute, then step S202 may further include:

searching out target attributes matched with all fields in the target metadata information from the established corresponding relation between the table fields and the designated attributes;

searching a second cleaning rule matched with the target attribute from the established corresponding relation between the attribute and the cleaning rule;

according to the first cleaning rule, generating a target cleaning task of the target data table, wherein the target cleaning task comprises the following steps: and generating a target cleaning task of the target data table according to the first cleaning rule and the second cleaning rule.

For example, assume that the correspondence of table fields to specified attributes is as shown in table 4.

TABLE 4 correspondence of Table fields to specified attributes

Table field	Specifying attributes
		User identity card number	Primary key field
User name	Whether the value is null or not
		Age of the user	Is free of
……	……

Then, it can be found from table 4 that the specified attribute corresponding to the field "user identification number" in table 1 is "primary key field", and the specified attribute corresponding to the field "user name" in table 1 is "whether the value can be null or not". That is, there are 2 target attributes of Table 1: the primary key field and value may be null.

It is assumed that the correspondence of the attributes to the cleansing rules is shown in table 5.

TABLE 5 correspondence of attributes to cleaning rules

From table 5, it can be found that the cleaning rule corresponding to the attribute "primary key field" is a "deduplication rule", and the cleaning rule corresponding to the attribute "whether the value is null" is a "null value filtering rule".

The cleaning task corresponding to table 1 further includes 2 second cleaning rules: a deduplication rule and a null filtering rule.

Thus, the cleaning tasks corresponding to table 1 finally include 6 cleaning rules: the method comprises the following steps of identity card 15-bit to 18-bit rule, identity card validity filtering rule, character string length filtering, value range filtering rule, duplication removing rule and null value filtering rule.

It can be seen that, in step S202, the cleaning rule in the cleaning task is automatically obtained according to the extracted metadata information, and there is no need for a user to drag different cleaning rules on a human-computer interaction interface or to configure the cleaning task through an SQL command, so that the whole cleaning process can be automatically performed, and a plurality of data tables can be used for batch cleaning, which is very flexible.

It should be noted that the above-mentioned correspondence between table fields and designated attributes is the correspondence between table fields and designated attributes of the intermediate table. And under the condition that the user modifies the table field specified attribute of the intermediate table, the established corresponding relation between the table field and the specified attribute refers to the corresponding relation between the modified table field and the specified attribute.

Therefore, in an exemplary implementation process, before finding out the target attribute matching each field in the target metadata information from the established correspondence between the table field and the specified attribute, the method further includes:

receiving attribute modification information of a specified field in the target metadata information, wherein the attribute modification information is used for indicating that the specified attribute of the specified field is modified into a specified attribute;

modifying the designated attribute of the designated field in the intermediate table according to the attribute modification information;

updating the corresponding relation between the table field corresponding to the intermediate table and the designated attribute according to the modification of the designated attribute of the designated field;

the finding out the target attribute matched with each field in the target metadata information from the established corresponding relation between the table field and the designated attribute comprises: and searching the target attribute matched with each field in the target metadata information from the corresponding relation between the updated table field and the specified attribute.

In step S203, the intermediate table in the data warehouse is cleaned, and the content of the intermediate table is the same as that of the target data table, so that performing a cleaning operation on the intermediate table is equivalent to performing a cleaning operation on the target data table, and the cleaning result obtained on the intermediate table is the cleaning result of the target data table.

Cleaning the intermediate table according to each cleaning rule in the cleaning task, namely cleaning each line of data in the intermediate table by using each cleaning rule in the cleaning task respectively, and writing the line of data into a problem table or writing the line of data into a standard table with a problem identification field if the dirty data is judged according to any cleaning rule; and if the data is judged not to be dirty according to all the cleaning rules, writing the data into the standard table.

In one example, the cleaning process for the target row of the intermediate table corresponding to the target data table may be:

for each cleaning rule in the target cleaning task, reading data of a field corresponding to the cleaning rule from a target row of the intermediate table;

if the cleaning rule is a filtering rule, determining whether the read data is legal or not according to the cleaning rule; if the cleaning rule is a standardization rule, determining whether the read data needs to be standardized according to the cleaning rule;

if all the filtering rules in the target cleaning task determine that the data is legal and the data of the target row is standardized by all the standardized rules in the target cleaning task or determined not to be required to be standardized, determining that the data of the target row is not dirty data and writing the data of the target row into a standard table;

and if any filtering rule in the target cleaning task determines that the data is illegal, determining that the data of the target row is dirty data, and writing the data of the target row into a problem table or writing the data of the target row carrying a problem identification field into the problem table.

For example, the cleaning task corresponding to table 1 above includes 6 cleaning rules. The first row of data in table 1, the validity filtering rule of the identification card, and the rule of converting 15 bits of the identification card into 18 bits are taken as examples to describe the cleaning process:

reading data 'xxxxxxxxxxxxxxx 1' in a user identity number field in a first row in table 1, judging whether the 'xxxxxxxxxxxxxxx 1' is legal or not according to an identity card validity filter rule, if the 'xxxxxxxxxxxxx 1' is legal, executing the next operation, if the 'xxxxxxxxxxxxxxx 1' is legal, writing all data in the first row in a problem table, and finishing the process of cleaning the data in the first row;

reading data 'xxxxxxxxxxxxxxxxx 1' in a user identity number field in the first row in table 1, judging whether 'xxxxxxxxxxxxxxxxx 1' needs to be converted according to a rule of converting 15 bits of the identity card into 18 bits, if so, converting 'xxxxxxxxxxxxxxx 1' into 18 bits of identity numbers, and executing the next operation after conversion; if the conversion is not needed, directly executing the next operation;

……

in step S204, the target standard table may be sent to the device where the specified database is located by another Spark App. For example, in one example, step S204 may include:

data in a standard table of a Data warehouse is extracted through a Hive text selection (HiveContext Select) SQL statement, and the extracted Data is written into a specified database through an Insert (Insert intro) SQL statement of a java naming interface (JDBC Data Source).

After step S204, at least one additional table in the designated database is added: target standard table. When the problem data is written into the problem table, the problem table is also sent to the specified database, and two tables are added in the specified database: target criteria table and problem table.

In one exemplary implementation, the number of target data tables is at least one.

The example allows a plurality of data tables to be cleaned in batches, effectively shortens the overall cleaning time of the whole cleaning target data, and further improves the processing efficiency of the cleaning process.

In an exemplary implementation, the data cleansing method may further include:

in the process of cleaning the intermediate table, writing problem data in the intermediate table and a problem identification field corresponding to the problem data into a standard table; alternatively, the first and second electrodes may be,

and writing the problem data in the intermediate table into the problem table in the process of performing the cleaning operation on the intermediate table.

In this example, two processing methods are provided for problem data: firstly, the problem data is discarded into a problem table; and secondly, writing the problem identification field carried by the problem data into a standard table.

In one example, the issue identification field may have two: dirty flag (dirty _ flag), dirty type (dirty _ type).

The numeric area of dirty _ flag is 0 and 1, and both dirty _ flag will be set to 1 according to the data judged as problem data by the filtering rule and the data failed to be converted by the standardized rule.

The value of dirty _ type is 128-bit binary character string, default is all 0, the value of each bit character is 0 and 1, and subscript is 0-127 bits from left to right. Each digit character of dirty _ type has a corresponding rule, for example, a digit character set to 1 indicates that a particular rule determines that the value is illegal, and a digit character set to 0 indicates that a particular rule determines that the value is legal.

The processing of the problem data is not performed in the data warehouse.

When a user needs to process question data, the question data can be inquired in the following way:

first, when the problem data is stored in the problem table, the problem can be solved by "Select from problem

Table "command to query data in the problem table;

in the second mode, when the problem data is stored in the standard table, the problem data in the standard table may be queried through a "Select from standard table where good dirty _ flag is" 1' "command.

In an exemplary implementation, the data cleansing method may further include:

in the process of cleaning the intermediate table, sending the problem data in the intermediate table to a problem data backtracking device so as to modify the problem data by the problem data backtracking device to obtain corrected data;

and writing the correction data into the target standard table.

In this example, the correction data is data that meets the requirements obtained by modifying the problem data. For example, if the id number in a certain record is 15 bits, and the id number in the standard data meeting the requirement is 18 bits, the record is modified in the following way: and modifying the ID card number to 18 bits to obtain modified data.

The data cleaning method provided by the embodiment of the invention extracts the target metadata information of the target data table from the specified database, copies the content of the target data table to the intermediate table in the data warehouse, determines the target cleaning task of the target data table according to the target metadata information, wherein the target cleaning task comprises at least one cleaning rule, performs cleaning operation on the intermediate table according to each cleaning rule in the target cleaning task in the data warehouse to obtain the target standard table corresponding to the target data table, sends the target standard table to the equipment where the specified database is located, so that the equipment stores the target standard table to the specified database, can automatically generate the cleaning task of the data table according to the extracted metadata information, automatically finishes cleaning the data table according to the cleaning task, automatically performs the whole cleaning process without manual intervention, and all data tables in a database or a system can be cleaned in batch, so that the time required by the cleaning process is effectively shortened, and the processing efficiency of data cleaning is improved.

Based on the above method embodiment, the embodiment of the present invention further provides corresponding apparatus, device, and storage medium embodiments. For detailed implementation of the embodiments of the apparatus, device and storage medium of the embodiments of the present invention, please refer to the corresponding descriptions in the foregoing method embodiments.

FIG. 3 is a functional block diagram of a data cleansing apparatus according to an embodiment of the present invention. As shown in fig. 3, in this embodiment, the data cleansing apparatus may include:

an extraction and copy module 310, configured to extract target metadata information of a target data table from a specified database, and copy contents of the target data table to an intermediate table in a data warehouse;

a task determining module 320, configured to determine a target cleaning task of the target data table according to the target metadata information, where the target cleaning task includes at least one cleaning rule;

a cleaning module 330, configured to perform a cleaning operation on the intermediate table according to each cleaning rule in the target cleaning task to obtain a target standard table;

a sending module 340, configured to send the target standard table to the device where the specified database is located, so that the device stores the target standard table in the specified database.

In an exemplary implementation, the task determination module 320 is specifically configured to:

searching out a first cleaning rule matched with the target data element from the established corresponding relation between the standard data element and the cleaning rule;

and generating a target cleaning task of the target data table according to the first cleaning rule and the second cleaning rule.

In an exemplary implementation, the target metadata information includes a table field specifying attribute, and the task determination module 320 is further configured to:

searching out a second cleaning rule matched with the target attribute from the established corresponding relation between the attribute and the cleaning rule;

when the task determining module 320 is configured to generate the target cleaning task of the target data table according to the first cleaning rule, specifically: and generating a target cleaning task of the target data table according to the first cleaning rule and the second cleaning rule.

In one exemplary implementation, the apparatus further comprises:

a modification information receiving module, configured to receive attribute modification information for a specified field in the target metadata information, where the attribute modification information is used to indicate that a specified attribute of the specified field is modified into a specified attribute;

the attribute modification module is used for modifying the specified attribute of the specified field in the intermediate table according to the attribute modification information;

the relation updating module is used for updating the corresponding relation between the table field corresponding to the intermediate table and the specified attribute according to the modification of the specified attribute of the specified field;

the task determining module 320 is specifically configured to, when the task determining module is configured to find out a target attribute matching each field in the target metadata information from the corresponding relationship between the table field and the specified attribute that is already established: and searching the target attribute matched with each field in the target metadata information from the corresponding relation between the updated table field and the specified attribute.

In one exemplary implementation, the apparatus further comprises:

the data writing module is used for writing the problem data in the intermediate table and the problem identification field corresponding to the problem data into the standard table in the process of executing cleaning operation on the intermediate table; or, the method is used for writing the problem data in the intermediate table into the problem table in the process of executing the cleaning operation on the intermediate table.

In one exemplary implementation, the apparatus further comprises:

and the problem table sending module is used for sending the problem table to the equipment where the specified database is located so that the equipment stores the problem table to the specified database.

In one exemplary implementation, the apparatus further comprises:

in the process of cleaning the intermediate table, sending the problem data in the intermediate table to a problem data backtracking device so that the problem data is modified by the problem data backtracking device to obtain modified data;

and writing the correction data into the target standard table.

The embodiment of the invention also provides data cleaning equipment. Fig. 4 is a hardware configuration diagram of a data cleansing apparatus according to an embodiment of the present invention. As shown in fig. 4, the data cleansing apparatus includes: an internal bus 401, and a memory 402, a processor 403, and an external interface 404 connected through the internal bus.

The processor 403 is configured to read the machine-readable instructions in the memory 402 and execute the instructions to implement the following operations:

extracting target metadata information of a target data table from a specified database, and copying the content of the target data table to an intermediate table in a data warehouse;

determining a target cleaning task of the target data table according to the target metadata information, wherein the target cleaning task comprises at least one cleaning rule;

in the data warehouse, cleaning operation is carried out on the intermediate table according to each cleaning rule in the target cleaning task to obtain a target standard table;

and sending the target standard table to the equipment where the specified database is located, so that the equipment stores the target standard table to the specified database.

In an exemplary implementation, processor 403 further executes the instructions to perform the following operations:

and generating a target cleaning task of the target data table according to a first cleaning rule.

In an exemplary implementation, the target metadata information includes a table field specifying attribute, and the processor 403 further executes the instructions to:

according to a first cleaning rule, the target cleaning task for generating the target data table specifically comprises the following steps: and generating a target cleaning task of the target data table according to the first cleaning rule and the second cleaning rule.

the finding out the target attribute matched with each field in the target metadata information from the established corresponding relationship between the table field and the designated attribute specifically comprises: and searching the target attribute matched with each field in the target metadata information from the corresponding relation between the updated table field and the specified attribute.

in the process of executing cleaning operation on the intermediate table, writing problem data in the intermediate table and a problem identification field corresponding to the problem data into the standard table; alternatively, the first and second electrodes may be,

and writing the problem data in the intermediate table into a problem table in the process of executing the cleaning operation on the intermediate table.

and sending the problem table to the equipment where the specified database is located, so that the equipment stores the problem table to the specified database.

and writing the correction data into the target standard table.

An embodiment of the present invention further provides a computer-readable storage medium, where a plurality of computer instructions are stored on the computer-readable storage medium, and when executed, the computer instructions perform the following processing:

In one exemplary implementation, the computer instructions when executed further perform the following:

In one exemplary implementation, the target metadata information includes table field specifying attributes, and the computer instructions, when executed, further perform the following:

and writing the correction data into the target standard table.

For the device and apparatus embodiments, as they correspond substantially to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for processing a data table, the method comprising:

2. The method according to claim 1, wherein the data cleansing device further determines a target cleansing task of the target data table according to the target metadata information, wherein the target cleansing task comprises at least one cleansing rule, and performs a cleansing operation on the intermediate table according to each cleansing rule in the target cleansing task to obtain a target standard table;

the method further comprises the following steps:

and receiving the target standard table sent by the data cleaning equipment, and storing the target standard table to the specified database.

3. The method of claim 2, wherein determining the target cleaning task for the target data table according to the target metadata information comprises:

the data cleaning equipment searches out target data elements matched with all fields in the target metadata information from the established corresponding relation between the table fields and the standard data elements;

the data cleaning equipment searches a first cleaning rule matched with the target data element from the established corresponding relation between the standard data element and the cleaning rule;

and the data cleaning equipment generates a target cleaning task of the target data table according to a first cleaning rule.

4. The method of claim 3, wherein after the data cleansing device finds the first cleansing rule matching the target data element from the established correspondence between the standard data element and the cleansing rule, the method further comprises:

the data cleaning equipment searches out target attributes matched with all fields in the target metadata information from the established corresponding relation between the table fields and the designated attributes;

the data cleaning equipment searches a second cleaning rule matched with the target attribute from the established corresponding relation between the attribute and the cleaning rule;

the data cleaning equipment generates a target cleaning task of the target data table according to a first cleaning rule, and the target cleaning task comprises the following steps: and the data cleaning equipment generates a target cleaning task of the target data table according to the first cleaning rule and the second cleaning rule.

5. The method according to claim 2, wherein the data cleansing device writes the problem data in the intermediate table and the problem identification field corresponding to the problem data into the target standard table during the cleansing operation performed on the intermediate table; or writing the problem data in the intermediate table into a problem table in the process of executing cleaning operation on the intermediate table;

the method further comprises the following steps:

and receiving the problem table sent by the data cleaning equipment, and storing the problem table to the specified database.

6. The method of claim 1, wherein the data cleansing device copying contents of the target data table to an intermediate table in a data warehouse, comprising:

the data cleaning equipment uses a data warehouse tool to locally create a middle database which has the same name as a designated database where the target data table is located and create a middle table which has the same table name and the same table structure as the target data table in the middle database according to the target metadata information;

and the data cleaning equipment reads the content data stored in the target data table and writes the read content data into the intermediate table.

7. The method of claim 1, wherein the metadata information includes general metadata information and table field specifying attributes; and the data cleaning equipment respectively stores the general metadata information and the table field designated attributes of the target data table, and respectively stores the general metadata information and the table field designated attributes of the intermediate table.

8. The method of claim 2, wherein the data cleansing process of the data cleansing device on the target row of the intermediate table comprises:

9. The method of claim 1, further comprising:

after confirming the metadata information of the intermediate table according to the metadata confirmation instruction, adding a scene identification bit in the configuration of the main key of the intermediate table by the data cleaning equipment, wherein the scene identification bit is used for distinguishing different application scenes.

10. A data table processing apparatus, the apparatus comprising:

11. A data table processing apparatus comprising a processor and a memory for storing executable instructions of the processor;

the processor is configured to: