CN111177200A

CN111177200A - Data processing system and method

Info

Publication number: CN111177200A
Application number: CN201911421978.4A
Authority: CN
Inventors: 方磊; 王清臣; 武华亭
Original assignee: Beijing Zetyun Tech Co ltd
Current assignee: Beijing Zetyun Tech Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19
Anticipated expiration: 2039-12-31
Also published as: CN111177200B

Abstract

The invention provides a data processing system and a method, wherein the data processing system comprises: the interface module is used for receiving a first input operation of a user on a data set interface to obtain a data set to be processed; an inference module for inferring type information of the set of data to be processed; the first determining module is used for determining a target data processing strategy based on the type information of the data set to be processed; and the processing module is used for processing the data of the data set to be processed by utilizing the target data processing strategy. According to the embodiment of the invention, the data processing process can be simplified, and the applicability of data preparation can be improved.

Description

Data processing system and method

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing system and a data processing method.

Background

The success of big data mining and analysis often depends on data preparation. At present, for data preparation, corresponding data are often processed directly by manual work. However, due to the limitation of manual processing, the conventional data preparation method has poor applicability and low efficiency.

Disclosure of Invention

The embodiment of the invention provides a data processing system and a data processing method, which aim to solve the problems of poor applicability and low efficiency of the conventional data preparation mode.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a data processing system, including:

the interface module is used for receiving a first input operation of a user on a data set interface to obtain a data set to be processed;

an inference module for inferring type information of the set of data to be processed;

the first determining module is used for determining a target data processing strategy based on the type information of the data set to be processed;

and the processing module is used for processing the data of the data set to be processed by utilizing the target data processing strategy.

Optionally, the system further includes:

a second determining module, configured to determine a target data set from the to-be-processed data set;

the first determining module is specifically configured to: determining the target data processing policy for the target data set based on the type information of the to-be-processed data set.

Optionally, the first determining module includes:

the first recommending unit is used for recommending a plurality of data processing strategies aiming at the target data set based on the type information of the data set to be processed;

a first determining unit, configured to determine an optimal policy of the multiple data processing policies as the target data processing policy.

Optionally, the interface module is further configured to: receiving a second input operation of the user on the strategy interface;

the first determining module includes:

the second recommending unit is used for recommending a plurality of data processing strategies aiming at the target data set based on the type information of the data set to be processed;

a second determination unit configured to determine a policy selected by a user from the plurality of data processing policies as the target data processing policy in response to the second input operation; or, in response to the second input operation, determining a second policy as the target data processing policy, where the second policy is obtained by a user adjusting a first policy of the multiple data processing policies.

the first determining module is specifically configured to: in response to the second input operation, determining a third policy for the target data set, which is defined by a user based on the type information of the data set to be processed, as the target data processing policy.

Optionally, the system further includes:

a display module for displaying at least one of: the type information of the data set to be processed, the field information in the recommended data processing strategy and the data consanguinity relationship in the recommended data processing strategy.

Optionally, the type information includes at least one of:

the service type of each data set in the data set to be processed;

and the data base type and/or the service type of each column of data in the data set to be processed.

Optionally, the inference module is specifically configured to:

and deducing the service type of each data set in the data set to be processed based on a preset domain model.

Optionally, the inference module comprises:

the calling unit is used for sequentially calling the type inference functions corresponding to the pre-constructed data basic types based on the preset sequence;

and the first inference unit is used for inferring a data base type of each column of data in the data set to be processed based on the called type inference function.

Optionally, the inference module comprises:

and the second inference unit is used for inferring the service type of each line of data in the data set to be processed based on a multi-classification model established in advance.

Optionally, the second inference unit includes:

the processing subunit is configured to, after obtaining a data base type of each line of data in the to-be-processed data set, process the line of data in the to-be-processed data set and a corresponding data base type into a feature vector;

and the inference subunit is used for inputting the feature vector into the pre-established multi-classification model and inferring the service type of each line of data in the data set to be processed.

Optionally, the domain model is a structural rule or a template of a data set in a corresponding domain.

Optionally, the target data processing policy includes at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.

Optionally, the execution sequence of the target data processing policy includes any one of:

the first step is a data quality processing strategy, the second step is a merged data set, and the third step is a data derivation processing strategy;

the first step is a data quality processing strategy, and the second step is a data set merging strategy and synchronously performing a data derivation processing strategy;

the first step is a data quality processing strategy, the second step is a data derivation processing strategy, and the third step is a merged data set;

the first step is merging data sets, the second step is a data quality processing strategy, and the third step is a data derivation processing strategy;

the first step is a data derivation processing strategy, the second step is a data set combination, and the third step is a data quality processing strategy.

Optionally, the data set to be processed includes a target data set and a plurality of first data sets, and the target data processing policy includes at least one of:

a data quality handling policy for the target data set and/or the first data set;

a data derivation processing policy for the target data set and/or the first data set;

selecting at least two data sets from the data sets to be processed for merging;

and taking the target field in the target data set as a main key field, and extracting related data from the data set to be processed and/or the processed data set to be processed to form a combined data set.

Optionally, the data quality processing policy includes at least one of:

deleting a null value column, filling a missing value, removing duplication, sorting, filtering, deleting an abnormal row, setting an abnormal value to be null, rounding a numerical value and processing a date format.

Optionally, the data derivation processing policy includes at least one of:

the method comprises the steps of numerical range marking processing, date extracting processing and data aggregation processing.

Optionally, the manner of merging the data sets includes any one of the following:

a join connection mode and an union splicing mode.

Optionally, the manner of removing the redundant column includes any one of the following:

calculating correlation coefficients among different columns, and reserving one of the different columns with correlation higher than a preset value;

and carrying out principal component analysis and dimension reduction on the column data to obtain columns in a preset range.

In a second aspect, an embodiment of the present invention provides a data processing method, including:

receiving a first input operation of a user on a data set interface to obtain a data set to be processed;

inferring type information of the dataset to be processed;

determining a target data processing strategy based on the type information of the data set to be processed;

and processing the data of the data set to be processed by using the target data processing strategy.

Optionally, before determining the target data processing policy based on the type information of the to-be-processed data set, the method further includes:

determining a target data set from the data set to be processed;

the determining a target data processing strategy based on the type information of the data set to be processed comprises:

determining the target data processing policy for the target data set based on the type information of the to-be-processed data set.

Optionally, the determining the target data processing policy for the target data set based on the type information of the data set to be processed includes:

recommending a plurality of data processing strategies for the target data set based on the type information of the data set to be processed;

determining an optimal policy of a plurality of data processing policies as the target data processing policy.

Optionally, the method further includes:

receiving a second input operation of the user on the strategy interface;

the determining the target data processing policy for the target data set based on the type information of the to-be-processed data set includes:

determining a policy selected by a user from the plurality of data processing policies as the target data processing policy in response to the second input operation;

or, in response to the second input operation, determining a second policy as the target data processing policy, where the second policy is obtained by a user adjusting a first policy of the multiple data processing policies.

Optionally, the method further includes:

receiving a second input operation of the user on the strategy interface;

in response to the second input operation, determining a third policy for the target data set, which is defined by a user based on the type information of the data set to be processed, as the target data processing policy.

Optionally, the method further includes:

displaying at least one of: the type information of the data set to be processed, the field information in the recommended data processing strategy and the data consanguinity relationship in the recommended data processing strategy.

Optionally, the type information includes at least one of:

the service type of each data set in the data set to be processed;

Optionally, the inferring type information of the to-be-processed data set includes:

sequentially calling type inference functions corresponding to all pre-constructed data basic types based on a preset sequence;

and deducing the data base type of each column of data in the data set to be processed based on the called type inference function.

and deducing the service type of each line of data in the data set to be processed based on a pre-established multi-classification model.

Optionally, the inferring a service type of each column of data in the to-be-processed data set based on a pre-established multi-classification model includes:

after the data base type of each line of data in the data set to be processed is obtained, processing the line data of the data set to be processed and the corresponding data base type into a feature vector;

and inputting the characteristic vector into the pre-established multi-classification model, and deducing the service type of each line of data in the data set to be processed.

Optionally, the target data processing policy includes at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.

Optionally, the data quality processing policy includes at least one of:

Optionally, the data derivation processing policy includes at least one of:

a join connection mode and an union splicing mode.

In a third aspect, an embodiment of the present invention provides a data processing system, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the data processing method.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, can implement the steps of the data processing method.

In the embodiment of the invention, the type information of the data set to be processed is deduced, the target data processing strategy is determined based on the type information, and the data processing is carried out by utilizing the target data processing strategy, so that the data processing process can be simplified, and compared with the method for carrying out data preparation manually, the applicability and the efficiency of data preparation can be improved, and the method is favorable for subsequent model training (such as machine learning), business analysis, data mining and the like.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data set interface according to an embodiment of the present invention;

FIG. 3 is a block diagram of another data processing system according to an embodiment of the present invention;

FIG. 4 is a block diagram of another data processing system according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a policy interface according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a selection interface for a target data table in accordance with an embodiment of the present invention;

FIG. 7 is a schematic illustration of another policy interface according to an embodiment of the present invention;

fig. 8 is a flowchart of a data processing method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention, and as shown in fig. 1, the data processing system 10 may include:

the interface module 11 is configured to receive a first input operation of a user on a data set interface, and obtain a data set to be processed.

In this embodiment, the data in the to-be-processed data set may be structured data or semi-structured data. It should be noted that the present embodiment mainly relates to structured data, and a data set in this case may be referred to as a data table.

Alternatively, the structured or semi-structured data may be derived from a document type data source, or alternatively, from a database type data source. The file type data source refers to a file system of which the data source is an HDFS (Hadoop Distributed file system) and/or a stand-alone file system. The distributed file system file format may include, but is not limited to: CSV, TSV, TXT, parquet, Excel, ORC, etc.; the stand-alone file system file format may include, but is not limited to: CSV, TSV, TXT, parquet, Excel, etc. The database type data source refers to a data source being a DBMS (database management System), including but not limited to at least one of the following databases: oracle Oracle database, DB2 database, SQL Server database, MySQL database, PostgreSQL database, Hive database, Teredata database, Greenplus database, GaussDB database.

In one embodiment, the data set to be processed may be a plurality of data sets automatically selected based on user input and/or the system. The plurality of data sets may be some or all of the created data sets, and further may be some or all of the data sets in a certain field. For example, referring to fig. 2, the process of obtaining the data set to be processed at this time may be: and displaying a created data set list in a designated area (such as a left column area of FIG. 2) in the data set interface, and screening out a data set indicated by a selection operation from the created data set list by receiving the selection operation executed by a user aiming at the created data set list, so as to obtain a data set to be processed.

In another embodiment, the to-be-processed data set may specifically refer to a target data set. For example, the target data table is pre-designated by the user, or the target data table is recommended by default by the system.

Optionally, for creating the data set, the specific process may be: the data set interface comprises a creation control for creating the data set, the operation aiming at the creation control is detected, the data set creation interface is displayed in a designated area of the data set interface in response to the operation, and a user creates a new data set by operating the data set creation interface. The user can select to import the data in the data source into the data processing system in an uploading mode by operating the data set creating interface, and then a new data set is created; or, the user can select to add the access address of the data source to the data set creation interface by operating the data set creation interface, so that the data processing system accesses the data source through the access address to create a new data set. It should be noted that, if data in a data source is imported into the data processing system in an uploading manner, so as to create a new data set, before creating a new data set, data in a file type data source and/or a database type data source needs to be exported in a file form; then, when a new data set is created, the user can drag the file to a data set creation interface in a dragging mode so as to upload the file to a data processing system, and creation of the new data set is completed. Wherein the file format of the exported file may include, but is not limited to, at least one of: CSV, TSV, TXT, XLS, ZIP, TAR.

And the inference module 12 is used for inferring the type information of the data set to be processed.

A first determining module 13, configured to determine a target data processing policy based on the type information of the to-be-processed data set.

And the processing module 14 is configured to perform data processing on the data set to be processed by using the target data processing policy.

In an embodiment of the present invention, the data processing system may infer an overall business type of each data set in the data set to be processed, and/or a data type of each column of data in the data set to be processed, where the data type includes, but is not limited to, a data base type, a business type, and the like. The type information inferred by the inference module 12 may include at least one of:

the service type of each data set in the data set to be processed;

the data base type and/or the business type of each column of data in the data set to be processed.

The service type (or called as an overall service type) of each data set in the data set to be processed can be identified through the domain model. The inference module 12 is specifically configured to infer a traffic type of each data set in the data set to be processed based on a preset domain model. It is noted that the data set may be categorized based on domain and based on system (this represents the data processing system in this embodiment, the same applies hereinafter) default recommendations or user selected domains. The user can self-define and adjust the service type of the data set. The system may preset default domain models with different models for different domains (e.g., models for data tables). For example, the model of the banking domain includes at least one of: a customer information table, a transaction flow water meter, a certificate table, a verification code table and the like; the model of the e-commerce domain includes at least one of: user tables, session tables, transaction tables, log tables, etc.; the model of the traffic domain includes at least one of: vehicle watches, violation watches, personnel watches, and the like. In practical applications, the domains and the domain models can be continuously expanded, for example, the expansion is performed based on user settings, that is, the domain models are constructed based on the domains newly set by the user and the tables therein.

In one embodiment, the domain model may be selected as a structural rule or template of a data set (i.e., a data table) under the corresponding domain. For example, the structural rules of the user table may include: the specific field names include what (such as a user identification ID, a user name, a user registration date, etc.), the category of the basic type of each field data, the category of the service type of each field, and the like.

In another embodiment, when the business type of the data set to be processed is inferred based on the preset domain model, the inferred business types can be sorted according to the matching degree, and a best matching business type is preferably given.

Optionally, the data base type may include at least one of: integer type, long integer type, floating point type, double precision type, time type (for example, the format is: year, month, day, hour, minute and second, YYYY-MM-DD HH: MM: SS, etc.), character string type, Boolean type, etc. The service type of the column data may include at least one of: a telephone number (e.g., cell phone number), an identification number, a zip code, a length of time, a date, an amount, a point in geographic coordinates, a geographical line in WKT format, a polygon, an english country name or ISO country code, an electronic Mail (E-Mail) address, a temperature, a gender, a size, a weight, a user-defined business type, etc.

In the embodiment of the invention, the data base type of the column data is an attribute of the data, and the service type of the column data is an attribute with actual service meaning, so that the data processing system can adopt a targeted processing method for the data based on the data base type and the service type. Optionally, the value of the column data corresponding to each type of data base type has a certain value condition, for example, for data with an integer data base type, the value condition of the column data is an integer; for data with a floating point type data base type, the numeric condition of column data is a decimal; for data with a boolean data base type, the value of the column data is 0 or 1, and the like, so that the method and the device can construct a type inference function corresponding to each type of data base type based on the value conditions of the column data corresponding to each type of data base type, and thus, when the data base type corresponding to the column data in the data to be processed is inferred, the data base type of each column data in the data set to be processed can be determined by using the construction function corresponding to each type of data base type. As shown in fig. 3, the inference module 12 may include:

the calling unit 121 is configured to sequentially call a type inference function corresponding to each pre-constructed data base type based on a preset sequence;

a first inference unit 122, configured to infer a data base type of each column of data in the to-be-processed data set based on the called type inference function.

The preset sequence for calling the inference functions of each type may be a default calling sequence of the data processing system, or may also be a calling sequence set by the user based on user requirements.

Optionally, since the service type of the column data has an actual service meaning, and the value of each service type data has a certain value rule in combination with the actual service meaning of the service type, the application provides a feasible way for implementing the service type inference, that is: and when the business type corresponding to the column data is inferred, inferring the corresponding business type based on the value-taking rule of the business type. For some complex data sets, in order to guarantee that the business types of the complex data are relatively accurately deduced, the business type deduction method can also be combined with a machine learning technology, and a machine learning model is utilized to carry out business type deduction on the column data of the data set to be processed, namely the business types of the column data of the data set to be processed are deduced on the basis of a multi-classification model established in advance. As shown in fig. 3, the inference module 12 may further include:

and a second inference unit 123, configured to infer a service type of each column of data in the to-be-processed data set based on a multi-classification model established in advance.

The pre-established multi-classification model can be obtained according to a machine learning model training process and mainly comprises two stages: a data preparation phase and a model training phase. The data preparation stage mainly comprises the steps of obtaining a large amount of column data, labeling business type labels on the column data, forming a sample set by the large amount of column data labeled with the business type labels, wherein the sample set can be divided into two parts, one part is used as a training sample set for training a model, and the other part is used as a test sample set for testing the training model after training. The model training stage is to use the training sample set to perform model training.

Further, since there is a certain relationship between the data base type of the column data and the service type, the data base type may provide certain prior information for inferring the service type, and in order to improve the speed and accuracy of inferring the service type of the column data by the multi-classification model, when the service type of the column data is inferred based on the pre-established multi-classification model, the method specifically includes: after the data base type of each line of data in the to-be-processed data set is obtained, processing the line data of the to-be-processed data set and the corresponding data base type into a feature vector (the manner of processing the line data into the feature vector can adopt the prior art, and the embodiment does not limit the manner); and then, inputting the characteristic vector into the pre-established multi-classification model, and deducing the service type of each line of data in the data set to be processed.

That is, the second inference unit 123 described above may include:

In the embodiment of the present invention, in order to clarify the processing policy, when the to-be-processed data set is a plurality of to-be-processed data sets, a target data set may be preferentially determined from the plurality of to-be-processed data sets, and a target data processing policy may be determined for the target data set. In addition, the target data set may also be recommended by the data processing system default or pre-specified by the user. The number of target data sets may be plural, and a corresponding one or more data processing policies may be recommended for each target data set.

Optionally, as shown in fig. 4, the data processing system further includes:

a second determining module 15, configured to determine a target data set from the to-be-processed data set.

Further, the first determining module 13 is specifically configured to: determining the target data processing policy for the target data set based on the type information of the to-be-processed data set.

In one embodiment, the data processing system may select the target data set based on the overall business type of the data set to be processed.

In another embodiment, the data processing system may select a target data set based on statistical analysis of a domain model of the data set to be processed in the corresponding domain; such as rules for selecting a target data set based on statistical analysis of data set names, fields in the data sets, etc.

In another embodiment, the data processing system may select the target data set based on semantic analysis. Specifically, the name of the data set may be semantically analyzed by using a preset semantic analysis rule to determine the target data set. For example, a data set named "transaction table" may be determined as a target data table and a data set named "transaction detail table" may be determined as a non-target data table (e.g., as an auxiliary table) using semantic analysis rules.

In another embodiment, the data processing system may select the target data set based on an input operation by a user.

For the sequence of obtaining the data set to be processed and deducing the data type of the data set, the data set to be processed can be obtained firstly, and then the data type is deduced; or deducing the data type first and then obtaining a data set to be processed; it is also possible to obtain the data set to be processed and perform the data type inference simultaneously, which is not limited in this embodiment. In this embodiment, it is preferable to perform data type inference on the created data set first, and then obtain the data set to be processed.

It should be noted that the data processing system in this embodiment may automatically recommend a plurality of data processing policies, and select a target data processing policy (such as an optimal policy) based on user selection or system default. Generally, if the user does not select direct click execution, the execution system defaults to the optimal strategy. In one possible implementation, the method further includes: displaying at least one data processing strategy through a recommended strategy management area of a strategy interface; responding to the strategy selection operation of a user in a recommended strategy management area, and presenting the selected data processing strategy in the recommended strategy management area to a target strategy management area of a strategy interface; and determining the data processing strategy presented in the target strategy management area as the data processing strategy for processing the data set to be processed. For example, referring to the policy interface diagram shown in fig. 5, the recommendation policy management area on the left side includes system recommendation policies and other policies, and the target policy management area on the right side includes policies selected by the user; specifically, the left side is the recommended policy and other policies of the system, and the right side is the default 1 optimal policy of the system selected by the user. When the user clicks and expands the 1 optimal strategy, the strategy displayed comprises a plurality of steps: abnormal row deletion, automatic filling of null values and automatic correction of date formats.

In addition, the user may also adjust the policies and/or customize the policies and save the policies for next use or for use by other users. Specifically, the system may provide some basic processing methods, such as the specific steps in fig. 5, and the user may select specific steps to form a new policy to customize the policy (specifically, may be a customized policy for the target data set), or adjust specific steps in the policy to adjust the policy.

Optionally, as shown in fig. 4, the first determining module 13 may include:

a first recommending unit 131, configured to recommend a plurality of data processing strategies for the target data set based on the type information of the data set to be processed;

a first determining unit 132, configured to determine an optimal policy of the multiple data processing policies as the target data processing policy.

In this way, the optimal strategy can be automatically recommended based on the system, so that the data can be optimally processed.

Optionally, the interface module 11 may further be configured to: and receiving a second input operation of the user on the strategy interface. As shown in fig. 4, the first determining module 13 may further include:

a second recommending unit 133, configured to recommend a plurality of data processing strategies for the target data set based on the type information of the data set to be processed;

a second determining unit 134 for determining a policy selected by a user from the plurality of data processing policies as the target data processing policy in response to the second input operation; or, in response to the second input operation, determining a second policy as the target data processing policy, where the second policy is obtained by a user adjusting a first policy of the multiple data processing policies.

In addition, when there is a second input operation of the user on the policy interface, the first determining module 13 may be further specifically configured to: in response to the second input operation, determining a third policy for the target data set, which is defined by a user based on the type information of the data set to be processed, as the target data processing policy.

It is understood that, in a specific implementation, the first recommending unit 131 and the second recommending unit 133 may be the same unit. In this way, selecting a target data processing policy in conjunction with user input operations may improve the effectiveness of the selected policy.

In one embodiment, the data processing system when recommending data processing strategies relies on content including, but not limited to, at least one of: the overall business type of the target data set, the business type of a unique column in the target data set (the value of each row in the unique column is different and can be automatically identified by the system), the data base type and/or the business type of each column of data in each data set to be processed.

Further, as shown in fig. 4, the data processing system 10 may further include:

a display module 16 for displaying at least one of: the type information of the data set to be processed, the field information in the recommended data processing strategy, the data consanguinity relationship in the recommended data processing strategy and the like.

Wherein, the field information in the recommended data processing policy includes but is not limited to: field name, field meaning, field calculation method, etc. The data context relationships in the recommended data processing strategy include, but are not limited to: data consanguinity at the data set level, data consanguinity at the field level, and so on. In this way, it is convenient for the user to know the required policy and related information.

In the embodiment of the present invention, the data processing policy recommended by the system may include at least two types: one is used for improving data quality (data quality processing), that is, simply performing data processing, for example, filling missing values in a certain column based on a median, discarding rows where blank values in a certain column are located, and the like; the other type is used for deriving data, that is, performing data derivation processing, for example, splitting a date column, aggregating values of a certain column, and the like. Further, merging data sets, removing redundant columns, and the like may also be included.

Optionally, the target data processing policy includes at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.

Wherein, in the case that the target data processing policy includes a data quality processing policy, a data derivation processing policy and a merged data set, the execution order of the target data processing policy may include, but is not limited to, any one of:

For example, the specific policy in this embodiment may include: in the scheme (1), data quality processing is performed on each to-be-processed data set, then each processed data set is merged, and then data derivation processing is performed (wherein the data derivation processing can be performed synchronously with merging of each processed data set); in the scheme (2), each data set is merged firstly, and then data quality processing and data derivation processing are carried out; and (3) firstly carrying out data quality processing and data derivation processing, and then merging the processed data sets. While this embodiment is preferred to the embodiment (1).

In one embodiment, policy steps for improving data quality may be recommended based on the data type and data quality of the column data, and then policy steps for derivation may be recommended based on the data type and optionally an aggregation function of the column data. The data type of the column data in this manner may include only a data base type or a service type, or may include both a data base type and a service type.

Optionally, if the to-be-processed data set includes a target data set and a plurality of first data sets, the target data processing policy may further include at least one of the following:

and taking the target field in the target data set as a main key field, and extracting related data from the data set to be processed and/or the processed data set to be processed to form a combined data set. Where the target field is optional as a primary key field, it may be a unique column field in the target data set, where the value of each row in the unique column field uniquely identifies a record in the corresponding data set. The main key field is used as a connection field (a field in a plurality of merged data sets), an intermediate merged data set or a final merged data set is generated, and the extracted related data is determined according to the target data processing strategy.

And taking a target field in the target data set as a main key field, extracting related data from the data set to be processed and/or the processed data set to be processed to form a combined data set, and performing data derivation processing strategies aiming at the target data set and/or the first data set synchronously.

Further, the data quality handling policy may include, but is not limited to, at least one of: null column delete, missing value fill, deduplication, sorting, filtering, exception row delete, setting of exception value to null, numeric rounding processing (e.g., round up, round down), date format processing (e.g., format unification processing), and so on. And the data derivation processing policy may include, but is not limited to, at least one of: a numerical range labeling process, a date extraction process (such as extracting a year, month, day, etc.), a data aggregation process, and the like.

Wherein, the data aggregation processing mode can include but is not limited to at least one of the following: mean (calculating the average value of a certain column of data), Sum (calculating the Sum of a certain column of data), Count (calculating the number of a certain column of data), Max (calculating the maximum value of a certain column of data), Min (calculating the minimum value of a certain column of data), Variance (calculating the Variance of a certain column of data), standard deviation (calculating the standard deviation of a certain column of data), Mode (calculating the Mode of a certain column of data), Median (calculating the Median of a certain column of data), Distinct (calculating the number of non-repeated values of a certain column of data), interquartile range (IQR, calculating the quartile difference of a certain column of data, which is a method in describing statistics to determine the difference between the third quartile and the first quartile), and the like.

Further, the manner of merging the data sets may include any one of the following: a join connection mode and an union splicing mode. Wherein, a merged data set (such as a large-width table) can be generated by join connection or union splicing, that is, a single data set is generated, and the data set is transversely expanded or longitudinally spliced according to the type of the data set. The join connection method refers to grouping a plurality of data into one data set according to a connection field (which refers to a field having the same meaning, such as a user ID), that is, the data set is expanded horizontally (columns are increased). For the union splicing mode, the same type of fields in a plurality of data sets are selected, and the data sets are longitudinally spliced to form one data set. It can be understood that if there are both the same and different fields between several tables and there are a large number of different fields, it makes sense to perform join connection, and if the fields of two tables are mutually inclusive, it makes sense to perform union concatenation.

Further, the manner of removing the redundant column may include any one of the following: 1) calculating correlation coefficients among different columns, and reserving one of the different columns with correlation higher than a preset value; 2) principal Component Analysis (PCA) dimensionality reduction is performed on the column data to obtain columns in a preset range. For the PCA dimension reduction method and the column of the preset range, reference may be made to the existing method, which is not limited in this embodiment. In this way, by calculating the relationship between columns (fields) to remove redundant columns, the problem of dimension explosion can be avoided.

In one embodiment, taking a transaction table as an example, the transaction table includes the following fields: transaction ID, session ID, transaction time, product ID, transaction amount; then, the system recommended strategy for improving data quality may include: 1) deleting the record that the transaction amount is empty; 2) converting the transaction time into a uniform time format; 3) converting the transaction amount into a uniform currency unit; and/or, the system recommended data-derived policies may include: 1) extracting the week from the time of the transaction as a new column (e.g., 7 columns for the new addition of Monday to Sunday, the values of which may be 0 and 1 to identify that a transaction occurred on the day of the week); 2) extracting a time period of 1-24 hours from the transaction time as a new column (for example, adding 24 columns of 0 and 1-23, and identifying the time period in which a transaction occurs by using 0 and 1); 3) the SUM aggregation function is used to calculate the daily transaction amount, the transaction amount, and the transaction amount, for each time period of the day for the week.

It should be noted that, in the present embodiment, the manner of data quality processing and the manner of data derivation processing may be different based on different data types. In the data quality processing, only the quality of the data is changed, and no new data (such as new rows or columns) is added, for example, one row is a numerical value row, and if there is a blank value, the blank value can be filled with an average value or a median of other values. Data derivation adds new columns, changes the number of rows, or adds additional tables based on the original data set, e.g., the original column is date and has the format YYYY-MM-dd, and derives new columns based on the column, e.g., year, month, day, weekday, etc.

In addition, the specific data processing mode may be different for different data types. Such as missing value padding: the missing value filling processing mode differs for the numerical value type, the category type, and the like. The data base type may include a numerical type, a category type. The value type includes a base type of one of: integer, long integer, floating point, double precision types. The category type may include a base type of one of: time type, character string type, boolean type, etc., and the value taken is a finite value.

For the merged data set and the data quality and/or derivation process, the data set may be merged first and then the data quality and/or derivation process is performed, or the data quality and/or derivation process may be performed first and then the data set is merged; it is also possible to merge data sets and synchronize the data quality and/or derivation process.

In one embodiment, data quality processing may be performed on each data table separately, for example, null column deletion, missing value padding, deduplication, sorting, filtering, exception row deletion, setting the exception value to null, etc.; then merging the processed data tables; and finally, performing further data processing, such as data derivation processing, according to the strategy.

Optionally, the data derivation processing policy may include: performing aggregation processing on the basis of the data set to be processed or the processed combined data set to generate an auxiliary data set; the aggregation processing may specifically be performing data statistics. The data derivation process may include: an auxiliary table is generated by performing aggregation processing (for example, statistical operation) based on the plurality of data tables or the processed combined data table. For example, the region is counted based on zip code, the person is counted based on identification number, and the sum and average of money are calculated.

For example, examples of generating the secondary table based on the aggregation process (e.g., statistical operations) may be as shown in tables 1 and 2 below. Table 1 is part of example data of the employee table, and each field sequentially indicates an employee ID, a department ID, an enrollment date, and a departure date. And the statistics result (part of the example data) shown in table 2 can be used to count how many persons are currently in each department respectively through the aggregation process.

TABLE 1

Employee ID	Department ID	Date of job entry	Date of departure
				10001	d001	2010/6/7	2015/6/7
10002	d001	2010/6/8	9999/1/1
				10003	d001	2010/6/9	9999/1/1
10004	d001	2010/6/10	9999/1/1
				10005	d002	2010/6/11	9999/1/1
10006	d006	2010/6/12	9999/1/1
				10007	d007	2010/6/13	9999/1/1
10008	d008	2010/6/14	9999/1/1
				10009	d002	2010/6/7	9999/1/1
10010	d006	2010/6/8	9999/1/1
				10011	d007	2010/6/9	9999/1/1
10012	d008	2010/6/10	9999/1/1
				10013	d002	2010/6/11	9999/1/1
10014	d006	2010/6/12	9999/1/1
				10015	d007	2010/6/13	9999/1/1
10016	d008	2010/6/14	9999/1/1
				10017	d002	2010/6/7	9999/1/1
10018	d006	2010/6/8	9999/1/1
				10019	d007	2010/6/9	2019/10/21
10020	d008	2010/6/10	2019/10/22

TABLE 2

The data processing procedure of the embodiment of the present invention is described below with reference to tables 3 to 11.

In the embodiment of the present invention, the corresponding data processing process may mainly include the following steps:

s1: selecting a plurality of data tables to be processed; tables 3 to 7 (only a part of each table is exemplified) are shown below.

Table 3 (Session table)

Watch 4 (user watch)

Table 5 (transaction table)

Watch 6 (Log watch)

Table 7 (detailed commodity)

S2: the data processing system (hereinafter referred to as the system) performs data type inference on the selected data table, including the overall business type of each data table and the data base type and the business type of each column of data in each data table. Specifically, the system may infer or the user may select the e-commerce domain and infer that the overall business type of each data sheet is: table 3 is a session table, table 4 is a user table, table 5 is a transaction table, table 6 is a log table, and table 7 is a product detail table.

S3: the system recommends a specified target data table. Optionally, the system may select one or more target data tables based on the overall traffic type of the data table; alternatively, the system may select the recommended target data table based on statistical analysis of the corresponding domain. For example, the system may select a more important and meaningful data table as the target data table based on statistical analysis, such as a user table that is more important and a log table that is not the target data table in the e-commerce field.

S4: based on the result of the data type inference in S2, the system recommends a corresponding data processing policy for each target data table.

S5: based on the data processing policy recommended in S4, the target data processing policy is selected, and then data processing is performed. In one embodiment, the data quality of each table may be improved by performing data quality processing on each data table, for example, if there is missing data in the data table, data padding may be performed; if the data quality in the data table is good, the data quality processing is not needed; then, a plurality of data tables are combined into a wide table, wherein a connection field is needed, the system automatically identifies a unique column as the connection field, a wide table can be generated for each unique column, and data derivation is synchronously performed when the data tables are combined, namely, the generated wide table is a data table which is subjected to data derivation simultaneously.

In addition, in conjunction with the interface operations, the data processing process of the embodiments of the present invention may include the steps of:

the method comprises the following steps: the system selects a target data table. Optionally, the system may detect a target data table based on an overall service type of the table among the multiple data tables to be processed selected by the user, and provide the detected target data table. The user can also customize the target data table or adjust the target data table recommended (automatically selected) by the system.

In one embodiment, the user interface for selecting a target data table may be as shown in FIG. 6 (the target table is the target data table in the figure). If the user characteristics are considered, the user table can be selected as a target data table; if the session characteristics are considered, a session table can be selected as a target data table; if the transaction characteristics are considered, the transaction table can be selected as the target data table. The target data table may also be customized by the user.

Step two: aiming at the selected target data table, the system recommends a corresponding data processing strategy, and further users can select the target data processing strategy (such as an optimal strategy).

in one embodiment, taking the user table in table 4 above as an example of a target data table, the corresponding interface graph may be as shown in fig. 7, where the left side is the optimal policy, the suboptimal policy and other policies recommended by the system, and the right side is the default optimal policy of the system selected by the user.

It will be appreciated that the strategy shown in FIG. 7 above is a simple example only. For the user table in table 4 as the target data table, merging with other data tables, performing data processing, and the like, the specific strategy may further include the following steps:

firstly, the following processing is executed for the zip code in the data table (such as the zip code of the user table):

1. checking and deleting abnormal values in the zip code;

2. the zip code is replaced with the actual address.

Secondly, the following processing is executed for the dates in each data table:

1. checking for unreasonable dates;

2. and deleting the records corresponding to unreasonable dates.

Thirdly, the following processing is executed for the commodity ID in the data table (for example, the commodity ID of the transaction table):

1. unreasonable IDs are deleted (no item detail list exists, which is considered unreasonable);

2. the actual product name is used in place of the article ID.

And fourthly, merging the data tables and carrying out data derivation treatment. For example, a data table obtained by extracting relevant information from other data tables and performing table merging and data derivation processing using a user ID field (a unique column of a target data table) in a user table (a target data table) as a primary key field may be shown in table 8 (a partial field included in the data table). The user ID is used as a connecting field to merge the user table and the session table into an intermediate merged data table (namely an intermediate merged data set), then the session ID is used to merge the intermediate merged data table and the transaction table, and data derivation processing is synchronously performed to obtain a final merged data table.

TABLE 8

Further, in the embodiment of the present invention, an auxiliary table may be generated by performing aggregation processing. For example, with respect to the transaction table in table 5 above as the target data table, the sales amount of each product can be counted based on the aggregation process, and the statistical result can be shown in table 9 below:

TABLE 9

product_id	Amount
		1	307.14
2	159.42
		3	135.05
4	43.59
		5	44.11

For another example, for the session table in table 3 above as the target data table, the number of users of each device may be counted based on the aggregation process, and the statistical result may be shown in table 10 below:

watch 10

Device	count_customer
		Tablet	3
Mobile	3
		Desktop	4

For another example, for the user table in table 4 above as the target data table, the number of users per zip code (zip _ code) may be counted based on the aggregation process, and the statistical result may be shown in the following table 11:

TABLE 11

Zip_code	count_customer
		60091	3
13244	2

The data processing system of the present invention is explained in the above embodiments, and the data processing method of the present invention will be explained in conjunction with the embodiments and the drawings.

Referring to fig. 8, an embodiment of the present invention further provides a data processing method, where the method includes the following steps:

step 801: and receiving a first input operation of a user on a data set interface to obtain a data set to be processed.

Step 802: and deducing the type information of the data set to be processed.

Step 803: and determining a target data processing strategy based on the type information of the data set to be processed.

Step 804: and processing the data of the data set to be processed by using the target data processing strategy.

In the embodiment of the invention, the type information of the data set to be processed is deduced, the target data processing strategy is determined based on the type information, and the data processing is carried out by utilizing the target data processing strategy, so that the data processing process can be simplified, and compared with the data preparation by manually defining the strategy, the applicability and the efficiency of the data preparation can be improved, and the method is favorable for subsequent model training (such as machine learning), business analysis, data mining and the like.

Optionally, before the step 803, the method further includes:

and determining a target data set from the data set to be processed.

And said step 803 comprises: determining the target data processing policy for the target data set based on the type information of the to-be-processed data set.

Further, the determining the target data processing policy for the target data set based on the type information of the data set to be processed includes:

Optionally, the method further includes:

receiving a second input operation of the user on the strategy interface;

Optionally, the method further includes:

receiving a second input operation of the user on the strategy interface;

Optionally, the method further includes:

Optionally, the type information includes at least one of:

the service type of each data set in the data set to be processed;

Optionally, step 802 includes: and deducing the service type of each data set in the data set to be processed based on a preset domain model.

Optionally, step 802 specifically includes:

sequentially calling type inference functions corresponding to all pre-constructed data basic types based on a preset sequence; and deducing the data base type of each column of data in the data set to be processed based on the called type inference function.

Optionally, step 802 specifically includes:

Optionally, the target data processing policy includes at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.

Optionally, the data quality processing policy includes at least one of:

Optionally, the data derivation processing policy includes at least one of:

a join connection mode and an union splicing mode.

In addition, an embodiment of the present invention further provides a data processing system, which includes a memory, a processor, and a computer program that is stored in the memory and can be run on the processor, where the computer program, when executed by the processor, can implement each process of the data processing method embodiment, and can achieve the same technical effect, and details are not repeated here to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements each process of the data processing method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for causing a data processing system (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data processing system, comprising:

2. The system of claim 1, further comprising:

3. The system of claim 2, wherein the first determining module comprises:

4. The system of claim 1, wherein the type information comprises at least one of:

the service type of each data set in the data set to be processed;

5. The system of claim 1, wherein the target data processing policy comprises at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.

6. A data processing method, comprising:

inferring type information of the dataset to be processed;

7. The method of claim 6, wherein before determining a target data processing policy based on the type information of the dataset to be processed, the method further comprises:

determining a target data set from the data set to be processed;

8. The method of claim 7, wherein determining the target data processing policy for the target data set based on the type information of the pending data set comprises:

9. The method of claim 6, wherein the type information comprises at least one of:

the service type of each data set in the data set to be processed;

10. The method of claim 6, wherein the target data processing policy comprises at least one of:

a data quality processing strategy;

a data derivation processing strategy;

merging the data sets;

redundant columns are removed.