WO2020140662A1

WO2020140662A1 - Data table filling method, apparatus, computer device, and storage medium

Info

Publication number: WO2020140662A1
Application number: PCT/CN2019/122323
Authority: WO
Inventors: 蔡健; 杨镭; 黄北辰; 郭凌峰; 付晓
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2019-01-02
Filing date: 2019-12-02
Publication date: 2020-07-09
Also published as: CN109783788A

Abstract

Provided is a data table filling method, comprising: obtaining a data table uploaded by a user; determining the name of an incomplete field in the data table, the incomplete field name missing a data value; according to the degree of association between the incomplete field name and other field names in the data table, determining a missing type of incomplete field name; according to the data values already in the data table, calculating the missing value according to a filling method corresponding to the missing type; according to the missing value, filling in the missing data value of the incomplete field name.

Description

Data table filling method, device, computer equipment and storage medium

This application requires priority to be submitted to the China Patent Office on January 02, 2019, with the application number 201910001784.2 and the priority of the Chinese patent application titled "Data Sheet Filling Method, Device, Computer Equipment, and Storage Media", the entire content of which is cited by reference Incorporated in this application.

Technical field

The present application relates to a data table filling method, device, computer equipment and storage medium.

Background technique

Report data is the data in the data table, which is one of the most common data forms in practical applications. It can be used for data analysis or report generation to users, such as loan business data, human resource data, insurance business data, etc. However, these report data inevitably lead to the lack of data values due to improper operation, system failure, human factors, etc.

However, the inventor realized that in the existing commercial data reporting platform, the missing data values in the data table are usually not processed, or the samples with missing data values are directly deleted, which often leads to the entire data. The distribution of report data in the table forms interference and affects the accuracy of data analysis.

Summary of the invention

According to various embodiments disclosed in the present application, a data table filling method, device, computer device, and storage medium are provided.

A data table filling method includes:

Obtain the data table uploaded by the user;

Determining an incomplete field name in the data table, the incomplete field name is missing a data value;

Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and

The missing data value of the incomplete field name is filled according to the missing value.

A data table filling device includes:

The data table acquisition module is used to obtain the data table uploaded by the user;

An incomplete field name determination module, configured to determine an incomplete field name in the data table, the incomplete field name is missing a data value;

A missing type determining module, configured to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

A missing value calculation module, configured to calculate the missing value according to the existing data value in the data table and according to the filling method corresponding to the missing type; and

The padding module is used to fill in the missing data value of the incomplete field name according to the missing value.

A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed The following steps:

Obtain the data table uploaded by the user;

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors perform the following steps:

Obtain the data table uploaded by the user;

The details of one or more embodiments of the application are set forth in the drawings and description below. Other features and advantages of this application will become apparent from the description, drawings, and claims.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an application scenario diagram of a data table filling method according to one or more embodiments.

FIG. 2 is a schematic flowchart of a data table filling method according to one or more embodiments.

FIG. 3 is a schematic flowchart of steps for calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to one or more embodiments.

FIG. 4 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to another or more embodiments.

FIG. 5 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to yet another embodiment.

6 is a block diagram of a data table filling device according to one or more embodiments.

7 is a block diagram of a computer device according to one or more embodiments.

detailed description

In order to make the technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

The data table filling method provided by this application can be applied in the application environment shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. The terminal 102 can obtain the data table uploaded by the user, send the data table to the server 104, and the server 104 calculates the correlation between the field names included in the data table, and feeds back the correlation between any two field names to the terminal 102 After determining the incomplete field in which the data value is missing in the data table, the terminal 102 determines the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table. The terminal 102 may further calculate the missing value according to the filling method corresponding to the missing type of the incomplete field name according to the existing data value in the data table, and fill in the missing data value of the incomplete field name according to the missing value. Data table is sent to the server 104. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a data table filling method is provided. The method is applied to the terminal in FIG. 1 as an example for illustration, and includes the following steps:

Step 202: Obtain the data table uploaded by the user.

A data table is a structured data table, such as a CSV (Comma-Separated Values) format. The CSV data table stores the table data in plain text. The stored table data includes numeric and character types. Specifically, a web interface may be provided, and the user uploads the data table through the web interface, and the terminal may obtain the data table uploaded by the user. In one embodiment, each user needs to generate a data table containing report data according to a preset file format or table template, so that the terminal can parse out the table structure information of the uploaded data table.

As shown in Table 1 below, it is a schematic diagram of a CSV format data table uploaded in an embodiment.

Table 1

As can be seen from Table 1 above, the elements in each row of the data table are separated by commas, and the elements in the first row are used to represent the column name of this column, also called the header or field name of the data table, The corresponding elements in this column are the data values corresponding to the field names, and one field name corresponds to multiple data values. From the second row, the data in each row represents a sample in the data table, and four samples are shown in Table 1 above.

Step 204: Determine the incomplete field name in the data table. The incomplete field name is missing a data value.

The incomplete field name is the field name where the data value is missing in the data table, and accordingly, the complete field name is the field name where the data value is not missing in the data table. For example, in Table 1 above, field names that belong to incomplete field names include: education, loan amount, and field names that belong to complete field names include: name, gender, age, region, loan time, and ID number.

Specifically, after acquiring the data table uploaded by the user, the terminal may determine that each field name of the data value in the data table is missing, that is, each incomplete field name.

In one embodiment, determining the incomplete field names in the data table includes: counting the number of data values corresponding to each field name in the data table; determining the total number of samples corresponding to the data table; when the number is less than the total number of samples, determining the field name Is not a full field name.

Specifically, for the field names included in the data table, the terminal may count the number of data values corresponding to each field name, and count the total number of samples included in the data table. When the number of data values corresponding to the field names is less than the sample When the total number indicates that the field name is missing a data value, the field name is determined to be an incomplete field name.

For example, in Table 1 mentioned above, when the terminal traverses the number of data values corresponding to the field name "Education", each time a non-"NULL" data value is queried, the corresponding number increases by 1. Until all the samples in the data table are traversed, the number of data values corresponding to the statistical field name "education" is "3", and the total number of samples is "4", so it can be determined that the field name "education" is an incomplete field name. Similarly, the field name "Loan Amount" can also be determined to be an incomplete field name.

Step 206: Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table.

The degree of relevance can represent an implicit connection between two field names. The greater the correlation between the two field names, the stronger the connection between the two field names. Conversely, the smaller the correlation between the two field names, the more the connection between the two field names. weak. For example, in Table 1 above, in the "region" where the lender is located, Beijing, Shenzhen, and Shanghai have generally higher house prices. Compared with other regions, the "loan amount" will also be generally higher, indicating the field name "region" There is an implicit connection with the "loan amount".

The missing type is used to describe the possible connection between the field name where the data value is missing and other field names. Determining the missing type of incomplete field names facilitates the use of corresponding padding methods to fill in missing data values. Missing types include completely random missing, random missing and non-random missing. It should be noted that the missing type corresponding to the incomplete field name may be both random and non-random missing, then the terminal may calculate the missing value corresponding to the incomplete field name by using a corresponding filling method as needed.

Specifically, after determining the incomplete field names in the data table, the terminal may calculate the correlation between the incomplete field names to be filled and other field names in the data table, and determine the incomplete fields to be filled according to the correlation degree. The missing type of the name.

In one embodiment, step 206, determining the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table includes: when the incomplete field name and other field names in the data table When the correlations of all are less than the first preset value, the missing type of the incomplete field name is determined to be completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset Value, the type of missing incomplete field name is determined to be random missing; when the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than the third preset value, the incomplete field name is determined The type of deletion is non-random deletion.

Specifically, if the terminal can set a corresponding threshold, calculate the correlation between the incomplete field name to be filled and the other field names in the data table, and determine the incomplete field name to be filled according to the relationship between the correlation and the threshold Of missing types.

If the correlation between the incomplete field name to be filled and other field names is less than the first preset value, it means that there is no implicit connection between the incomplete field name and the rest of the field names, so that the incomplete field name There is no correlation between the data value of the missing field name and the data value corresponding to the other field name. The missing value corresponding to the incomplete field name can be used to determine the incomplete field without referring to the data value corresponding to the other field name. The missing type of the name is completely random missing.

If the correlation between the incomplete field name to be filled and the at least one complete field name in the data table is greater than the set second preset value, there is a certain hidden between the incomplete field name and at least one complete field name Contains links, so that the missing data value of the incomplete field name has a certain association with the data value corresponding to at least one complete field name. When calculating the missing value corresponding to the incomplete field name, you need to refer to the at least one complete field name. Based on the data value, it can be determined that the missing type of the incomplete field name is random missing.

If the correlation between the current incomplete field name to be filled and at least one incomplete field name in the data table is greater than the set third preset value, it means that there is a certain degree between the incomplete field name and at least one incomplete field name Implied connection, so that there is a certain correlation between the missing data value of the incomplete field name and the data value corresponding to at least one incomplete field name, you need to refer to the at least one non-complete field name when calculating the missing value corresponding to the incomplete field name With the data value of the complete field name, it can be determined that the missing type of the incomplete field name is non-random missing.

In one of the embodiments, the data table filling method further includes the step of calculating the relevance: the mean and standard deviation corresponding to each field name in the statistical data table; according to the mean and standard deviation, the calculation between any two field names is based on the following formula Of relevance:

ρ _{(x, y)} represents the correlation between the field name X and the field name Y; μ _X represents the mean value corresponding to the field name X; μ _Y represents the mean value corresponding to the field name Y; σ _X represents the standard deviation corresponding to the field name X Σ _Y represents the standard deviation corresponding to the field name Y; E[(X-μ _X )(Y-μ _Y )] is the expected value of Z, Z=(X _i -μ _X )(Y _i -μ _Y ).

Specifically, the terminal may obtain the data value of the incomplete field name X, and find the average value μ _X of all the data values, and correspondingly, obtain the data value corresponding to another field name Y, and find the average value μ of all the data values of the field name Y. _Y , and then calculate the standard deviation corresponding to the field name X and the field name Y according to the relationship between the standard deviation and the mean, which can be calculated by the following formula:

There are a total of N data values for the field name X, X _i represents the i-th data value corresponding to the field name X, and then the terminal can calculate each data value of Z according to the calculated average and each data value of the field name, that is, the Z The i data values are (X _i -μ _X )(Y _i -μ _Y ), and then the average value of Z is calculated according to each data value of Z as the expected value.

In one embodiment, when calculating the correlation between incomplete field names and other field names, if the data value types of the two field names are both numeric, the data values of the two field names can be directly used Calculate the relevance. If at least one of the field names has at least one field name whose data value type is character, you can first count the enumeration value of the field name and match the corresponding data value for each enumeration value. You can convert character data values to numeric data values, and then calculate the relevance based on the matching data values.

For example, for the field name "education" in the data table, the enumeration values corresponding to the field name are counted, including: Ph.D., Master, Undergraduate, Junior College, Technical Secondary School, Junior High School, and unknown, which can be converted into corresponding data values in turn , Such as 6, 5, 4, 3, 2, 1, and 0, or sequentially converted to 100, 80, 70, 60, 50, 20, and 0, and then calculate the correlation based on the converted data value. The relationship between the converted data values should be consistent with the relationship between the character data values before conversion.

Step 208: Calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table.

Specifically, after the terminal determines the missing type corresponding to the current incomplete field name to be filled, it can calculate the missing corresponding to the incomplete field name according to the existing data value in the data table according to the filling method corresponding to the missing type value. The existing data in the data table can be roughly divided into two categories, one is the data value corresponding to the incomplete field name, and the other is the data value corresponding to the field name related to the incomplete field name.

In one of the embodiments, the missing type is a completely random missing; step 208, according to the existing data values in the data table, calculating the missing value according to the filling method corresponding to the missing type includes: when the data value type corresponding to the incomplete field name is For character type, the corresponding median is counted based on the existing data value of the incomplete field name, and the statistical median is used as the missing value corresponding to the incomplete field name; or, based on the existing data value of the incomplete field name The corresponding mode is counted, and the statistical mode is taken as the missing value corresponding to the incomplete field name; when the data value type corresponding to the incomplete field name is numeric, the corresponding data is counted according to the existing data value of the incomplete field name For the average number, use the statistical average as the missing value corresponding to the incomplete field name.

Specifically, when the missing type corresponding to the incomplete field name is completely random missing, it means that there is little connection between the missing data value of the incomplete field name and the existing data value of other field names in the data table, then the terminal The missing value can be calculated based on the existing data value of the incomplete field name itself.

The data value type corresponding to the field name is character type, which means that the type of the data value corresponding to the field name is character type, and the data value type is numeric type, which means that the type of the data value corresponding to the field name is pure numeric type. . For example, in Table 1 mentioned earlier, the data value type corresponding to the complete field name "age" is numeric, the data value type corresponding to the incomplete field name "Education" is character type, and the incomplete field name "Loan Amount" The corresponding data value is numeric.

When the terminal determines that the current missing type of the incomplete field name to be filled is completely random missing, and the data value type corresponding to the incomplete field name is character type, the terminal can use the existing data value of the incomplete field name Count the corresponding median, and use the median as the missing value corresponding to the incomplete field name; or, the terminal can also calculate the corresponding mode according to the existing data value of the incomplete field name The number is the missing value corresponding to the incomplete field name.

When the terminal determines that the missing type of the incomplete field name to be filled is completely random missing, and the data value type corresponding to the incomplete field name is numeric, the terminal may use the existing data value of the incomplete field name Count the corresponding average number, and use the averaged number as the missing value corresponding to the incomplete field name.

Step 210: Fill in the missing data value of the incomplete field name according to the missing value.

Specifically, after calculating the missing value corresponding to each incomplete field name in the data table according to the above steps 202 to 204, the terminal can use the respective missing value to fill in the missing data value of the incomplete field name. There is no longer missing data value in the filled data table, so that it is convenient for data analysis or statistics based on the filled data table.

In the above data table filling method, when the data table uploaded by the user is obtained, it is determined that the incomplete field name of the data value is missing in the data table, according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Value to fill in the missing data value of the incomplete field name, according to the above steps, you can fill in the missing data value of each incomplete field name in the data table, can effectively fill the data table, in this way, based on the data analysis of the filled data table The accuracy will also be significantly improved.

As shown in FIG. 3, in one of the embodiments, the missing type is a completely random missing; step 208, according to the existing data value in the data table, calculating the missing value according to the filling method corresponding to the missing type includes:

Step 302, it is determined that the first type sample of the data value corresponding to the incomplete field name is missing from the data table;

Step 304: Determine the second type of sample that exists in the data value corresponding to the incomplete field name in the data table;

Samples are data entries recorded in the data table, and each sample has its own data value under each field name. The first type of sample is a sample with missing data value corresponding to the incomplete field name to be filled in the data table, and the second type of sample is a sample with data value corresponding to the incomplete field name to be filled in the data table. For example, in Table 1 mentioned above, the second sample belongs to the first type of sample, the first sample, the third sample, and the fourth sample for the incomplete field name "Loan Amount" to be filled. The sample belongs to the second type of sample; and for the current incomplete field name "region" to be filled, the fourth sample belongs to the first type, the first sample, the second sample, and the third sample belong to the second Class samples.

Step 306: Count the number of samples of the first type of samples;

Step 308, calculating the ratio of the number of samples to the total number of samples;

Specifically, the type of deletion of the incomplete field currently to be filled is a completely random deletion, which means that there is not much connection between the name of the incomplete field to be filled and the names of other fields in the data table. The terminal can count the number of samples of the first type of sample and calculate the ratio of the number of samples of the first type of sample to the total number of samples in the data table.

Step 310, when the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced with the first value; the data value of the second type of sample under the incomplete field name is replaced with the second value.

When the ratio is greater than the threshold, it means that there are many samples with missing data values corresponding to the incomplete field names to be filled in the data table. For example, the threshold can be set to 50%. If more than half of the samples are filled in The data values under the complete field name are missing, which will inevitably affect data analysis and data statistics, and the incomplete field name has little connection with other field names, then the terminal can use the data value corresponding to the incomplete field name. By value, the data value of the first type sample under the incomplete field name is replaced with the first value; the data value of the second type sample under the incomplete field name is replaced with the second value.

For example, after the terminal determines that the incomplete field name "ID number" in the data table belongs to a completely random type, and counts more than half of the samples belong to the first type of sample, that is, more than half of the samples are in the "ID number" The data value under this field name is missing, then the terminal can replace the data value under the field name of the sample with the data value in the "ID card number" to "1", and the sample with the missing data value in the "ID card" The data value under the field name "Number" is replaced with "0". In this way, although a large number of data values are missing, the data value is not related to other existing data values in the data table. Replacing the original data value in a way can retain certain information compared to directly deleting all data values under the incomplete field name.

As shown in FIG. 4, in one of the embodiments, the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:

Step 402: Determine the complete field name related to the incomplete field name;

Specifically, when the missing type of the incomplete field name is randomly missing, it indicates that the incomplete field name is related to at least one complete field name in the data table, and the terminal may determine the relevant incomplete field name to be filled according to step 206. Full field name.

Step 404, cluster the samples in the data table according to the data value of the complete field name to obtain a clustering cluster;

Specifically, after determining at least one complete field name in the data table related to the current incomplete field name to be filled, the terminal may cluster all samples in the data table according to the data value corresponding to at least one complete field name To get a cluster.

In one embodiment, the terminal may cluster all samples according to the determined similarity between corresponding data values of at least one complete field name, or the terminal may also correspond to multiple data values corresponding to complete field names to multiple In the category, then cluster according to the category corresponding to the data value.

For example, for the complete field name "Working Year" related to the incomplete field name "End of Year Award", the terminal can cluster all the samples in the data table according to the complete field name "Working Year", for example, the working year Samples of 1 year and 2 years are classified into one category, samples with working years of 3 to 5 years are classified into one category, samples with working years of 6 to 8 years are classified into one category, and working years are 8 Samples older than one year are grouped together. When there are multiple complete field names related to the "year-end prize", the data values corresponding to the multiple complete field names can be combined to cluster the samples in the data table to obtain each cluster.

Step 406, it is determined that the third type sample of the data value corresponding to the incomplete field name is missing from the data table;

Further, the terminal counts the third-type samples with missing incomplete field names currently to be filled in the data table, and determines which of the clusters obtained in step 404 these third-type samples belong to.

Step 408: Calculate the average value of the samples included in the clusters of the third type of samples under the name of the incomplete field, and use the calculated average value as the missing value to be filled.

Specifically, after determining the cluster cluster to which the third type of sample belongs, the terminal may calculate the average value of all samples in the cluster cluster under the name of the incomplete field to be filled, and use the calculated average value as falling within the cluster cluster. The missing value of the sample in under the name of the incomplete field to be filled.

In this embodiment, when the missing type is random missing, the corresponding missing value can be calculated for the samples with missing data values after clustering the samples. Compared with using the same missing value to fill all samples in the incomplete In terms of the data value under the field name, the filled data value is more accurate.

As shown in FIG. 5, in one of the embodiments, the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:

Step 502: Determine a first sample set where the data value corresponding to the incomplete field name exists in the data table and a second sample set where the data value corresponding to the incomplete field name is missing;

In this embodiment, when the missing type of the incomplete field name to be filled is randomly missing, the terminal may also construct a prediction model according to the data value corresponding to the complete field name related to the incomplete field name in the data table, and use the prediction The model predicts data values with missing incomplete field names. Specifically, the terminal may first divide all the samples in the data table into two types. One type is the sample where the data value corresponding to the incomplete field name to be filled currently exists. The set formed by these samples is called the first sample set. The other type is the samples with missing data values corresponding to the incomplete field names to be filled at present. The set formed by these samples is called the second sample set.

Step 504: Construct a prediction model according to the data values corresponding to the complete field names in the first sample set related to the incomplete field names;

Further, the terminal may determine the complete field name related to the current incomplete field name to be filled, and then obtain the data values of all samples in the first sample set under the determined complete field name, and establish these data values and the incomplete field name. The prediction relationship between the data values corresponding to the field names.

Step 506: Input the data value corresponding to the complete field name of each sample in the second sample set into the prediction model, and output the predicted value of each sample in the second sample set under the incomplete field name through the prediction model;

Step 508: Use the predicted value as the missing value to be filled.

For example, the first sample set X = (001, 002, 003, 005, ...) obtained after dividing all samples in the data table into two categories according to whether the incomplete field name m exists or not, 001 represents the first sample , 002 represents the second sample, and so on, the second sample set X'= (004, 006, ...). The corresponding data value set of each sample in the first sample set X under the incomplete field name m is m=(m1, m2, m3, m5...); each sample in the second sample set X'is in the incomplete field name The corresponding data value under N1 is missing. Determine the complete field names related to the incomplete field name m, including n, p, and q. Obtain the data values of each sample in the first sample set X under the complete field names n, p, and q, according to n = (n1, n2, n3, n5...), p = (n1, n2, n3, n5...) , Q=(n1, n2, n3, n5...) and the hidden connection between the set m=(m1, m2, m3, m5...) to build a prediction model:

m=nw1+pw2+qw3+b, w1, w2, w3 and b are trainable model parameters.

The model here is just an example, which is only used to indicate that the input of the prediction model is n, p, and q, and the output is m. When constructing the prediction model, the parameters of the model can be adjusted in a gradient decreasing manner, so that the constructed prediction model can fit each sample in the first sample set.

After the prediction model is obtained, the data values of each sample in the second sample set under the complete field names n, p, and q can be input into the prediction model, and each sample is output in the incomplete field through the prediction model The corresponding data value under the name m can be filled with the output predicted value as the missing data value. In this way, the corresponding missing value of each sample under the incomplete field name m is not exactly the same, but is related to The complete field name of the has a great connection, which can improve the readiness of the missing value to be filled.

In a specific embodiment, the data table filling method specifically includes the following steps:

Get the data table uploaded by the user.

Determine the incomplete field name where the data value is missing in the data table.

The mean and standard deviation corresponding to each field name in the statistical data table.

Based on the mean and standard deviation, the correlation between any two field names is calculated according to the following formula:

ρ _{(x, y)} represents the correlation between the field name X and the field name Y; μ _X represents the mean value corresponding to the field name X; μ _Y represents the mean value corresponding to the field name Y; σ _X represents the standard deviation corresponding to the field name X ; Σ _Y represents the standard deviation corresponding to the field name Y; E[(X-μ _X )(Y-μ _Y )] is the expected value of Z, Z=(X-μ _X )(Y-μ _Y ).

When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing.

When the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, it is determined that the missing type of the incomplete field name is random missing.

When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than the third preset value, it is determined that the missing type of the incomplete field name is non-random missing.

When the missing type is completely random missing, and when the data value type corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the non-complete The missing value corresponding to the complete field name; or, the corresponding mode is counted according to the existing data value of the incomplete field name, and the statistical mode is used as the missing value corresponding to the incomplete field name.

When the missing type is completely random missing, and when the data value type corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the statistical average is used as the incomplete field Missing value corresponding to the name; or,

When the missing type is completely random missing, it is determined that the first type sample of the data value corresponding to the incomplete field name is missing in the data table; the second type sample of the data value corresponding to the incomplete field name in the data table is determined; The number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, replace the data value of the first type of sample under the incomplete field name with the first value; replace the second type of sample in the non The data value under the full field name is replaced with the second value.

When the missing type is random missing, determine the complete field name related to the incomplete field name; cluster the samples in the data table according to the data value of the complete field name to obtain a cluster cluster; determine that the non-complete field is missing in the data table The third type sample of the data value corresponding to the complete field name; calculate the average value of the samples included in the cluster of the third type sample under the incomplete field name, and use the calculated average value as the missing value to be filled; or,

When the missing type is random missing, determine the first sample set where the data value corresponding to the incomplete field name exists in the data table and the second sample set where the data value corresponding to the incomplete field name is missing; according to the first sample set The data values corresponding to the complete field names related to the incomplete field names are used to construct the prediction model; the data values corresponding to the complete field names of each sample in the second sample set are input into the prediction model, and the second sample set is output through the prediction model The predicted value of each sample under the name of the incomplete field; use the predicted value as the missing value to be filled.

Fill in missing data values for incomplete field names based on missing values.

It should be understood that although the steps in the flowcharts of FIGS. 2 to 5 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2 to 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or The execution order of the stages is not necessarily sequential, but may be executed in turn or alternately with other steps or sub-steps of the other steps or at least a part of the stages.

In one embodiment, as shown in FIG. 6, a data table filling device 600 is provided, including: a data table acquisition module 602, an incomplete field name determination module 604, a missing type determination module 606, a missing value calculation module 608 and Fill module 610, where:

The data table obtaining module 602 is used to obtain the data table uploaded by the user;

The incomplete field name determination module 604 is used to determine the incomplete field name in the data table, and the incomplete field name lacks the data value;

The missing type determination module 606 is used to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

The missing value calculation module 608 is used to calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table;

The filling module 610 is used to fill in the missing data value of the incomplete field name according to the missing value.

In one of the embodiments, the missing type determination module 606 is also used to count the number of data values corresponding to each field name in the data table; determine the total number of samples corresponding to the data table; when the number is less than the total number of samples, determine the field name as non- Full field name.

In one of the embodiments, the missing type determination module 606 is further used to determine the missing type of the incomplete field name when the correlation between the incomplete field name and other field names in the data table is less than the first preset value It is completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, it is determined that the missing type of the incomplete field name is random missing; when the incomplete field name When the degree of correlation with at least one incomplete field name in the data table is greater than the third preset value, it is determined that the missing type of the incomplete field name is non-random missing.

In one of the embodiments, the missing type is completely random missing; the missing value calculation module 608 is also used to calculate the corresponding value based on the existing data value of the incomplete field name when the data value type corresponding to the incomplete field name is character type The median of the data is taken as the missing value corresponding to the incomplete field name; or, the corresponding mode is counted according to the existing data values of the incomplete field name, and the statistical mode is used as the incomplete field name. Missing value; when the data value type corresponding to the incomplete field name is numeric, the corresponding average is calculated based on the existing data value of the incomplete field name, and the statistical average is used as the missing value corresponding to the incomplete field name .

In one of the embodiments, the missing type is completely random missing; the missing value calculation module 608 is also used to determine the first type of sample in which the data value corresponding to the incomplete field name is missing from the data table; to determine the incomplete field name in the data table The second type of samples with corresponding data values; count the number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, the data value of the first type of sample under the name of the incomplete field Replace with the first value; replace the data value of the second type of sample under the incomplete field name with the second value.

In one of the embodiments, the missing type is random missing; the missing value calculation module 608 is also used to determine the complete field name related to the incomplete field name; clustering the samples in the data table according to the data value of the complete field name, Get the cluster cluster; determine the third type of sample that lacks the data value corresponding to the incomplete field name in the data table; calculate the average value of the samples included in the cluster cluster of the third type sample under the incomplete field name, and calculate it The mean of is used as the missing value to be filled.

In one of the embodiments, the missing type is random missing; the missing value calculation module 608 is further used to determine the first sample set where the data value corresponding to the incomplete field name in the data table exists and the missing data value corresponding to the incomplete field name The second sample set of; build a prediction model based on the data values corresponding to the full field names in the first sample set related to the incomplete field names; input the data values corresponding to the full field names of the samples in the second sample set into the prediction In the model, the predicted value of each sample in the second sample set under the name of the incomplete field is output through the prediction model; the predicted value is used as the missing value to be filled.

In one of the embodiments, the data table filling device 600 further includes a correlation calculation module; the correlation calculation module is used to count the mean and standard deviation corresponding to each field name in the data table; according to the mean and standard deviation, calculate any according to the following formula The correlation between the two field names:

The above data table filling device 600, when acquiring the data table uploaded by the user, determines that the incomplete field name of the data value is missing in the data table, and according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Missing values are used to fill in the missing data values of the incomplete field names. According to the above steps, the missing data values of each incomplete field name in the data table can be filled, and the data table can be effectively filled. The accuracy of the analysis will also be significantly improved.

For the specific limitation of the data table filling device 600, reference may be made to the limitation on the method of filling the data table above, and details are not described herein again. Each module in the above data table filling device 600 may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a terminal, and an internal structure diagram thereof may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface, and an input device connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer-readable instructions are executed by the processor to implement a data table filling method. The input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball, or a touch pad provided on the computer device shell, or an external keyboard, touch pad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.

In one embodiment, the data table filling apparatus provided by the present application may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on a computer device as shown in FIG. 7. The memory of the computer device may store various program modules constituting the data table filling device 600, for example, the data table acquisition module 602, the incomplete field name determination module 604, the missing type determination module 606, and the missing value calculation module shown in FIG. 608 and fill module 610. The computer-readable instructions formed by the various program modules cause the processor to execute the steps in the data table filling method described in each embodiment of the present application described in this specification.

For example, the computer device shown in FIG. 7 may execute step S202 through the data table acquisition module in the data table filling apparatus 600 shown in FIG. 6. The computer device may execute step S204 through the incomplete field name determination module. The computer device may execute step S206 through the missing type determination module. The computer device may execute step S208 through the missing value calculation module. The computer device may execute step S210 through the filling module.

In one embodiment, a computer device is provided, which includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed The steps of the above data table filling method. Here, the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.

In one embodiment, one or more non-volatile computer-readable storage media storing computer-readable instructions are provided. When the computer-readable instructions are executed by one or more processors, the one or more processors Perform the steps of the above data table filling method. Here, the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.

A person of ordinary skill in the art may understand that all or part of the processes in the method of the above embodiments may be completed by instructing relevant hardware through a computer program, and the computer program may be stored in a non-volatile computer readable storage In the medium, when the computer program is executed, the process of the foregoing method embodiments may be included. Any references to memory, storage, databases, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be arbitrarily combined. In order to simplify the description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the scope described in this specification.

The above-mentioned embodiments only express several implementations of the present application, and their descriptions are more specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can also be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims

A data table filling method, including:

Obtain the data table uploaded by the user;

Determining an incomplete field name in the data table, the incomplete field name is missing a data value;

Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

Calculating the missing value according to the existing data value in the data table according to the filling method corresponding to the missing type; and filling the missing data value of the incomplete field name according to the missing value.
The method according to claim 1, wherein the determining the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table includes:

When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;

When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and

When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is a non-random missing.
The method according to claim 1, wherein the missing type is a completely random missing; the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes: :

When the type of the data value corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, counting the corresponding mode according to the existing data value of the incomplete field name, and using the statistical mode as the missing value corresponding to the incomplete field name; and

When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.
The method according to claim 1, wherein the missing type is a completely random missing; the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes: :

Determining that the first type sample of the data value corresponding to the incomplete field name is missing from the data table;

Determining a second type of sample in which the data value corresponding to the incomplete field name in the data table exists;

Count the number of samples of the first type of samples;

Calculating the ratio of the number of samples to the total number of samples; and

When the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced with the first value; the data of the second type of sample under the incomplete field name Replace the value with the second value.
The method according to claim 1, wherein the missing type is a random missing; and the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:

Determine the complete field name related to the incomplete field name;

Clustering the samples in the data table according to the data value of the complete field name to obtain a clustering cluster;

Determining that the third type sample of the data value corresponding to the incomplete field name is missing from the data table; and

Calculate the average value of the samples included in the cluster of the third type sample under the name of the incomplete field, and use the calculated average value as the missing value to be filled.
The method according to claim 1, wherein the missing type is a random missing; and the calculating of the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:

Determining a first sample set where the data value corresponding to the incomplete field name in the data table exists and a second sample set where the data value corresponding to the incomplete field name is missing;

Construct a prediction model according to the data value corresponding to the full field name in the first sample set related to the incomplete field name;

Input data values corresponding to the complete field names of each sample in the second sample set into the prediction model, and output each sample in the second sample set in the incomplete field name through the prediction model Predicted value; and

Use the predicted value as the missing value to be filled.
The method according to any one of claims 1 to 6, further comprising:

Count the mean and standard deviation corresponding to each field name in the data table; and

According to the mean and standard deviation, the correlation between any two field names is calculated according to the following formula:

Where ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the average value corresponding to the field name X; μ Y represents the average value corresponding to the field name Y; σ X represents the corresponding value corresponding to the field name X Standard deviation; σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
A data table filling device, including:

The data table acquisition module is used to obtain the data table uploaded by the user;

An incomplete field name determination module, configured to determine an incomplete field name in the data table, the incomplete field name is missing a data value;

A missing type determining module, configured to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

A missing value calculation module, configured to calculate the missing value according to the existing data value in the data table and according to the filling method corresponding to the missing type; and

The padding module is used to fill in the missing data value of the incomplete field name according to the missing value.
The apparatus according to claim 8, wherein the missing type determination module is further used when the correlation between the incomplete field name and other field names in the data table is less than the first preset Value, it is determined that the missing type of the incomplete field name is completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, Determining that the missing type of the incomplete field name is random missing; and when the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined The missing type of the incomplete field name is non-random missing.
The apparatus according to claim 8, wherein the missing value calculation module is further configured to, when the data value type corresponding to the incomplete field name is character type, based on the incomplete field name The corresponding median of the data value of is counted, and the median is counted as the missing value corresponding to the incomplete field name; or, the corresponding mode is counted according to the existing data value of the incomplete field name, Using the counted mode as the missing value corresponding to the incomplete field name; and when the data value type corresponding to the incomplete field name is numeric, the existing data value according to the incomplete field name is used Count the corresponding average number, and use the averaged number as the missing value corresponding to the incomplete field name.
The device according to claim 8, wherein the missing type is completely random missing; the missing value calculation module is further used to determine that the data value corresponding to the incomplete field name is missing from the data table Samples of the first type; determine the samples of the second type that exist in the data values corresponding to the incomplete field names in the data table; count the number of samples of the first type of samples; The ratio of; and when the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced by the first value; the second type of sample is in the incomplete field The data value under the name is replaced with the second value.
The device according to claim 8, wherein the missing type is random missing; the missing value calculation module is further used to determine a complete field name related to the incomplete field name; according to the complete field The data values of the name are clustered on the samples in the data table to obtain a clustering cluster; it is determined that the third type sample of the data value corresponding to the incomplete field name is missing from the data table; and calculating the first The average value of the samples included in the clusters of the three types of samples under the name of the incomplete field, and the calculated average value is used as the missing value to be filled.
The apparatus according to claim 8, wherein the missing type is a random missing; the missing value calculation module is further used to determine the first data value corresponding to the incomplete field name in the data table. A sample set and a second sample set with missing data values corresponding to the incomplete field names; constructing a prediction model based on the data values corresponding to the complete field names in the first sample set related to the incomplete field names ; Input the data value corresponding to the complete field name of each sample in the second sample set into the prediction model, and output each sample in the second sample set in the incomplete field through the prediction model The predicted value under the name; and use the predicted value as the missing value to be filled.
The device according to any one of claims 8 to 13, wherein the device further comprises a correlation calculation module for counting the mean and standard deviation corresponding to each field name in the data table; and according to the For the mean and standard deviation, calculate the correlation between any two field names according to the following formula:

Where ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the average value corresponding to the field name X; μ Y represents the average value corresponding to the field name Y; σ X represents the corresponding value corresponding to the field name X Standard deviation; σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:

Obtain the data table uploaded by the user;

Determining an incomplete field name in the data table, the incomplete field name is missing a data value;

Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and

The missing data value of the incomplete field name is filled according to the missing value.
The computer device of claim 15, wherein the processor further executes the following steps when executing the computer-readable instructions:

When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;

When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and

When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is non-random missing.
The computer device according to claim 15, wherein the type of deletion is a completely random deletion; the processor also executes the following steps when executing the computer-readable instructions:

When the data value type corresponding to the incomplete field name is a character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, counting the corresponding mode according to the existing data value of the incomplete field name, and using the statistical mode as the missing value corresponding to the incomplete field name; and

When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.
One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain the data table uploaded by the user;

Determining an incomplete field name in the data table, the incomplete field name is missing a data value;

Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;

Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and

The missing data value of the incomplete field name is filled according to the missing value.
The storage medium according to claim 18, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:

When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;

When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and

When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is a non-random missing.
The storage medium according to claim 18, wherein the type of deletion is a completely random deletion; when the computer-readable instructions are executed by the processor, the following steps are also performed:

When the type of the data value corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, based on the existing data value of the incomplete field name, the corresponding mode is counted, and the statistical mode is used as the missing value corresponding to the incomplete field name; and

When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.