WO2020140662A1 - Data table filling method, apparatus, computer device, and storage medium - Google Patents

Data table filling method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2020140662A1
WO2020140662A1 PCT/CN2019/122323 CN2019122323W WO2020140662A1 WO 2020140662 A1 WO2020140662 A1 WO 2020140662A1 CN 2019122323 W CN2019122323 W CN 2019122323W WO 2020140662 A1 WO2020140662 A1 WO 2020140662A1
Authority
WO
WIPO (PCT)
Prior art keywords
field name
missing
value
incomplete
type
Prior art date
Application number
PCT/CN2019/122323
Other languages
French (fr)
Chinese (zh)
Inventor
蔡健
杨镭
黄北辰
郭凌峰
付晓
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020140662A1 publication Critical patent/WO2020140662A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to a data table filling method, device, computer equipment and storage medium.
  • Report data is the data in the data table, which is one of the most common data forms in practical applications. It can be used for data analysis or report generation to users, such as loan business data, human resource data, insurance business data, etc. However, these report data inevitably lead to the lack of data values due to improper operation, system failure, human factors, etc.
  • the distribution of report data in the table forms interference and affects the accuracy of data analysis.
  • a data table filling method, device, computer device, and storage medium are provided.
  • a data table filling method includes:
  • the incomplete field name is missing a data value
  • the missing data value of the incomplete field name is filled according to the missing value.
  • a data table filling device includes:
  • the data table acquisition module is used to obtain the data table uploaded by the user
  • An incomplete field name determination module configured to determine an incomplete field name in the data table, the incomplete field name is missing a data value
  • a missing type determining module configured to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
  • a missing value calculation module configured to calculate the missing value according to the existing data value in the data table and according to the filling method corresponding to the missing type
  • the padding module is used to fill in the missing data value of the incomplete field name according to the missing value.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed The following steps:
  • the incomplete field name is missing a data value
  • the missing data value of the incomplete field name is filled according to the missing value.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the incomplete field name is missing a data value
  • the missing data value of the incomplete field name is filled according to the missing value.
  • FIG. 1 is an application scenario diagram of a data table filling method according to one or more embodiments.
  • FIG. 2 is a schematic flowchart of a data table filling method according to one or more embodiments.
  • FIG. 3 is a schematic flowchart of steps for calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to one or more embodiments.
  • FIG. 4 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to another or more embodiments.
  • FIG. 5 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to yet another embodiment.
  • FIG. 6 is a block diagram of a data table filling device according to one or more embodiments.
  • FIG. 7 is a block diagram of a computer device according to one or more embodiments.
  • the data table filling method provided by this application can be applied in the application environment shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network through the network.
  • the terminal 102 can obtain the data table uploaded by the user, send the data table to the server 104, and the server 104 calculates the correlation between the field names included in the data table, and feeds back the correlation between any two field names to the terminal 102
  • the terminal 102 determines the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table.
  • the terminal 102 may further calculate the missing value according to the filling method corresponding to the missing type of the incomplete field name according to the existing data value in the data table, and fill in the missing data value of the incomplete field name according to the missing value.
  • Data table is sent to the server 104.
  • the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a data table filling method is provided.
  • the method is applied to the terminal in FIG. 1 as an example for illustration, and includes the following steps:
  • Step 202 Obtain the data table uploaded by the user.
  • a data table is a structured data table, such as a CSV (Comma-Separated Values) format.
  • the CSV data table stores the table data in plain text.
  • the stored table data includes numeric and character types.
  • a web interface may be provided, and the user uploads the data table through the web interface, and the terminal may obtain the data table uploaded by the user.
  • each user needs to generate a data table containing report data according to a preset file format or table template, so that the terminal can parse out the table structure information of the uploaded data table.
  • Table 1 it is a schematic diagram of a CSV format data table uploaded in an embodiment.
  • the elements in each row of the data table are separated by commas, and the elements in the first row are used to represent the column name of this column, also called the header or field name of the data table,
  • the corresponding elements in this column are the data values corresponding to the field names, and one field name corresponds to multiple data values.
  • the data in each row represents a sample in the data table, and four samples are shown in Table 1 above.
  • Step 204 Determine the incomplete field name in the data table.
  • the incomplete field name is missing a data value.
  • the incomplete field name is the field name where the data value is missing in the data table
  • the complete field name is the field name where the data value is not missing in the data table.
  • field names that belong to incomplete field names include: education, loan amount, and field names that belong to complete field names include: name, gender, age, region, loan time, and ID number.
  • the terminal may determine that each field name of the data value in the data table is missing, that is, each incomplete field name.
  • determining the incomplete field names in the data table includes: counting the number of data values corresponding to each field name in the data table; determining the total number of samples corresponding to the data table; when the number is less than the total number of samples, determining the field name Is not a full field name.
  • the terminal may count the number of data values corresponding to each field name, and count the total number of samples included in the data table. When the number of data values corresponding to the field names is less than the sample When the total number indicates that the field name is missing a data value, the field name is determined to be an incomplete field name.
  • Step 206 Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table.
  • the degree of relevance can represent an implicit connection between two field names.
  • Table 1 above in the "region” where the lender is located, Beijing, Shenzhen, and Shanghai have generally higher house prices. Compared with other regions, the "loan amount" will also be generally higher, indicating the field name "region” There is an implicit connection with the "loan amount”.
  • the missing type is used to describe the possible connection between the field name where the data value is missing and other field names. Determining the missing type of incomplete field names facilitates the use of corresponding padding methods to fill in missing data values. Missing types include completely random missing, random missing and non-random missing. It should be noted that the missing type corresponding to the incomplete field name may be both random and non-random missing, then the terminal may calculate the missing value corresponding to the incomplete field name by using a corresponding filling method as needed.
  • the terminal may calculate the correlation between the incomplete field names to be filled and other field names in the data table, and determine the incomplete fields to be filled according to the correlation degree. The missing type of the name.
  • step 206, determining the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table includes: when the incomplete field name and other field names in the data table When the correlations of all are less than the first preset value, the missing type of the incomplete field name is determined to be completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset Value, the type of missing incomplete field name is determined to be random missing; when the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than the third preset value, the incomplete field name is determined
  • the type of deletion is non-random deletion.
  • the terminal can set a corresponding threshold, calculate the correlation between the incomplete field name to be filled and the other field names in the data table, and determine the incomplete field name to be filled according to the relationship between the correlation and the threshold Of missing types.
  • the correlation between the incomplete field name to be filled and other field names is less than the first preset value, it means that there is no implicit connection between the incomplete field name and the rest of the field names, so that the incomplete field name There is no correlation between the data value of the missing field name and the data value corresponding to the other field name.
  • the missing value corresponding to the incomplete field name can be used to determine the incomplete field without referring to the data value corresponding to the other field name.
  • the missing type of the name is completely random missing.
  • the missing data value of the incomplete field name has a certain association with the data value corresponding to at least one complete field name.
  • the correlation between the current incomplete field name to be filled and at least one incomplete field name in the data table is greater than the set third preset value, it means that there is a certain degree between the incomplete field name and at least one incomplete field name Implied connection, so that there is a certain correlation between the missing data value of the incomplete field name and the data value corresponding to at least one incomplete field name, you need to refer to the at least one non-complete field name when calculating the missing value corresponding to the incomplete field name With the data value of the complete field name, it can be determined that the missing type of the incomplete field name is non-random missing.
  • the data table filling method further includes the step of calculating the relevance: the mean and standard deviation corresponding to each field name in the statistical data table; according to the mean and standard deviation, the calculation between any two field names is based on the following formula Of relevance:
  • the terminal may obtain the data value of the incomplete field name X, and find the average value ⁇ X of all the data values, and correspondingly, obtain the data value corresponding to another field name Y, and find the average value ⁇ of all the data values of the field name Y. Y , and then calculate the standard deviation corresponding to the field name X and the field name Y according to the relationship between the standard deviation and the mean, which can be calculated by the following formula:
  • X i represents the i-th data value corresponding to the field name X
  • the terminal can calculate each data value of Z according to the calculated average and each data value of the field name, that is, the Z
  • the i data values are (X i - ⁇ X )(Y i - ⁇ Y ), and then the average value of Z is calculated according to each data value of Z as the expected value.
  • the data values of the two field names can be directly used Calculate the relevance. If at least one of the field names has at least one field name whose data value type is character, you can first count the enumeration value of the field name and match the corresponding data value for each enumeration value. You can convert character data values to numeric data values, and then calculate the relevance based on the matching data values.
  • the enumeration values corresponding to the field name are counted, including: Ph.D., Master, Undergraduate, Junior College, Technical Secondary School, Junior High School, and unknown, which can be converted into corresponding data values in turn , Such as 6, 5, 4, 3, 2, 1, and 0, or sequentially converted to 100, 80, 70, 60, 50, 20, and 0, and then calculate the correlation based on the converted data value.
  • the relationship between the converted data values should be consistent with the relationship between the character data values before conversion.
  • Step 208 Calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table.
  • the terminal After the terminal determines the missing type corresponding to the current incomplete field name to be filled, it can calculate the missing corresponding to the incomplete field name according to the existing data value in the data table according to the filling method corresponding to the missing type value.
  • the existing data in the data table can be roughly divided into two categories, one is the data value corresponding to the incomplete field name, and the other is the data value corresponding to the field name related to the incomplete field name.
  • the missing type is a completely random missing; step 208, according to the existing data values in the data table, calculating the missing value according to the filling method corresponding to the missing type includes: when the data value type corresponding to the incomplete field name is For character type, the corresponding median is counted based on the existing data value of the incomplete field name, and the statistical median is used as the missing value corresponding to the incomplete field name; or, based on the existing data value of the incomplete field name The corresponding mode is counted, and the statistical mode is taken as the missing value corresponding to the incomplete field name; when the data value type corresponding to the incomplete field name is numeric, the corresponding data is counted according to the existing data value of the incomplete field name For the average number, use the statistical average as the missing value corresponding to the incomplete field name.
  • the terminal The missing value can be calculated based on the existing data value of the incomplete field name itself.
  • the data value type corresponding to the field name is character type, which means that the type of the data value corresponding to the field name is character type, and the data value type is numeric type, which means that the type of the data value corresponding to the field name is pure numeric type.
  • the data value type corresponding to the complete field name "age” is numeric
  • the data value type corresponding to the incomplete field name "Education” is character type
  • the incomplete field name "Loan Amount” The corresponding data value is numeric.
  • the terminal can use the existing data value of the incomplete field name Count the corresponding median, and use the median as the missing value corresponding to the incomplete field name; or, the terminal can also calculate the corresponding mode according to the existing data value of the incomplete field name The number is the missing value corresponding to the incomplete field name.
  • the terminal may use the existing data value of the incomplete field name Count the corresponding average number, and use the averaged number as the missing value corresponding to the incomplete field name.
  • Step 210 Fill in the missing data value of the incomplete field name according to the missing value.
  • the terminal can use the respective missing value to fill in the missing data value of the incomplete field name. There is no longer missing data value in the filled data table, so that it is convenient for data analysis or statistics based on the filled data table.
  • the missing type is a completely random missing; step 208, according to the existing data value in the data table, calculating the missing value according to the filling method corresponding to the missing type includes:
  • Step 302 it is determined that the first type sample of the data value corresponding to the incomplete field name is missing from the data table
  • Step 304 Determine the second type of sample that exists in the data value corresponding to the incomplete field name in the data table
  • Samples are data entries recorded in the data table, and each sample has its own data value under each field name.
  • the first type of sample is a sample with missing data value corresponding to the incomplete field name to be filled in the data table
  • the second type of sample is a sample with data value corresponding to the incomplete field name to be filled in the data table.
  • the second sample belongs to the first type of sample, the first sample, the third sample, and the fourth sample for the incomplete field name "Loan Amount" to be filled.
  • the sample belongs to the second type of sample; and for the current incomplete field name "region" to be filled, the fourth sample belongs to the first type, the first sample, the second sample, and the third sample belong to the second Class samples.
  • Step 306 Count the number of samples of the first type of samples
  • Step 308 calculating the ratio of the number of samples to the total number of samples
  • the type of deletion of the incomplete field currently to be filled is a completely random deletion, which means that there is not much connection between the name of the incomplete field to be filled and the names of other fields in the data table.
  • the terminal can count the number of samples of the first type of sample and calculate the ratio of the number of samples of the first type of sample to the total number of samples in the data table.
  • Step 310 when the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced with the first value; the data value of the second type of sample under the incomplete field name is replaced with the second value.
  • the threshold can be set to 50%. If more than half of the samples are filled in The data values under the complete field name are missing, which will inevitably affect data analysis and data statistics, and the incomplete field name has little connection with other field names, then the terminal can use the data value corresponding to the incomplete field name. By value, the data value of the first type sample under the incomplete field name is replaced with the first value; the data value of the second type sample under the incomplete field name is replaced with the second value.
  • the terminal determines that the incomplete field name "ID number” in the data table belongs to a completely random type, and counts more than half of the samples belong to the first type of sample, that is, more than half of the samples are in the "ID number”
  • the terminal can replace the data value under the field name of the sample with the data value in the "ID card number" to "1", and the sample with the missing data value in the "ID card”
  • the data value under the field name "Number" is replaced with "0".
  • Replacing the original data value in a way can retain certain information compared to directly deleting all data values under the incomplete field name.
  • the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
  • Step 402 Determine the complete field name related to the incomplete field name
  • the terminal may determine the relevant incomplete field name to be filled according to step 206. Full field name.
  • Step 404 cluster the samples in the data table according to the data value of the complete field name to obtain a clustering cluster
  • the terminal may cluster all samples in the data table according to the data value corresponding to at least one complete field name To get a cluster.
  • the terminal may cluster all samples according to the determined similarity between corresponding data values of at least one complete field name, or the terminal may also correspond to multiple data values corresponding to complete field names to multiple In the category, then cluster according to the category corresponding to the data value.
  • the terminal can cluster all the samples in the data table according to the complete field name "Working Year”, for example, the working year Samples of 1 year and 2 years are classified into one category, samples with working years of 3 to 5 years are classified into one category, samples with working years of 6 to 8 years are classified into one category, and working years are 8 Samples older than one year are grouped together.
  • the data values corresponding to the multiple complete field names can be combined to cluster the samples in the data table to obtain each cluster.
  • Step 406 it is determined that the third type sample of the data value corresponding to the incomplete field name is missing from the data table
  • the terminal counts the third-type samples with missing incomplete field names currently to be filled in the data table, and determines which of the clusters obtained in step 404 these third-type samples belong to.
  • Step 408 Calculate the average value of the samples included in the clusters of the third type of samples under the name of the incomplete field, and use the calculated average value as the missing value to be filled.
  • the terminal may calculate the average value of all samples in the cluster cluster under the name of the incomplete field to be filled, and use the calculated average value as falling within the cluster cluster.
  • the corresponding missing value can be calculated for the samples with missing data values after clustering the samples. Compared with using the same missing value to fill all samples in the incomplete In terms of the data value under the field name, the filled data value is more accurate.
  • the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
  • Step 502 Determine a first sample set where the data value corresponding to the incomplete field name exists in the data table and a second sample set where the data value corresponding to the incomplete field name is missing;
  • the terminal may also construct a prediction model according to the data value corresponding to the complete field name related to the incomplete field name in the data table, and use the prediction The model predicts data values with missing incomplete field names. Specifically, the terminal may first divide all the samples in the data table into two types. One type is the sample where the data value corresponding to the incomplete field name to be filled currently exists. The set formed by these samples is called the first sample set. The other type is the samples with missing data values corresponding to the incomplete field names to be filled at present. The set formed by these samples is called the second sample set.
  • Step 504 Construct a prediction model according to the data values corresponding to the complete field names in the first sample set related to the incomplete field names;
  • the terminal may determine the complete field name related to the current incomplete field name to be filled, and then obtain the data values of all samples in the first sample set under the determined complete field name, and establish these data values and the incomplete field name.
  • Step 506 Input the data value corresponding to the complete field name of each sample in the second sample set into the prediction model, and output the predicted value of each sample in the second sample set under the incomplete field name through the prediction model;
  • Step 508 Use the predicted value as the missing value to be filled.
  • the corresponding data value under N1 is missing. Determine the complete field names related to the incomplete field name m, including n, p, and q.
  • nw1+pw2+qw3+b w1, w2, w3 and b are trainable model parameters.
  • the model here is just an example, which is only used to indicate that the input of the prediction model is n, p, and q, and the output is m.
  • the parameters of the model can be adjusted in a gradient decreasing manner, so that the constructed prediction model can fit each sample in the first sample set.
  • the data values of each sample in the second sample set under the complete field names n, p, and q can be input into the prediction model, and each sample is output in the incomplete field through the prediction model
  • the corresponding data value under the name m can be filled with the output predicted value as the missing data value.
  • the corresponding missing value of each sample under the incomplete field name m is not exactly the same, but is related to The complete field name of the has a great connection, which can improve the readiness of the missing value to be filled.
  • the data table filling method specifically includes the following steps:
  • the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the non-complete The missing value corresponding to the complete field name; or, the corresponding mode is counted according to the existing data value of the incomplete field name, and the statistical mode is used as the missing value corresponding to the incomplete field name.
  • the missing type is completely random missing, and when the data value type corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the statistical average is used as the incomplete field Missing value corresponding to the name; or,
  • the missing type is completely random missing, it is determined that the first type sample of the data value corresponding to the incomplete field name is missing in the data table; the second type sample of the data value corresponding to the incomplete field name in the data table is determined; The number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, replace the data value of the first type of sample under the incomplete field name with the first value; replace the second type of sample in the non The data value under the full field name is replaced with the second value.
  • the missing type is random missing
  • determine the complete field name related to the incomplete field name cluster the samples in the data table according to the data value of the complete field name to obtain a cluster cluster; determine that the non-complete field is missing in the data table
  • the third type sample of the data value corresponding to the complete field name; calculate the average value of the samples included in the cluster of the third type sample under the incomplete field name, and use the calculated average value as the missing value to be filled; or,
  • the missing type is random missing
  • the data values corresponding to the complete field names related to the incomplete field names are used to construct the prediction model; the data values corresponding to the complete field names of each sample in the second sample set are input into the prediction model, and the second sample set is output through the prediction model
  • the predicted value of each sample under the name of the incomplete field use the predicted value as the missing value to be filled.
  • steps in the flowcharts of FIGS. 2 to 5 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2 to 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or The execution order of the stages is not necessarily sequential, but may be executed in turn or alternately with other steps or sub-steps of the other steps or at least a part of the stages.
  • a data table filling device 600 including: a data table acquisition module 602, an incomplete field name determination module 604, a missing type determination module 606, a missing value calculation module 608 and Fill module 610, where:
  • the data table obtaining module 602 is used to obtain the data table uploaded by the user;
  • the incomplete field name determination module 604 is used to determine the incomplete field name in the data table, and the incomplete field name lacks the data value;
  • the missing type determination module 606 is used to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
  • the missing value calculation module 608 is used to calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table;
  • the filling module 610 is used to fill in the missing data value of the incomplete field name according to the missing value.
  • the missing type determination module 606 is also used to count the number of data values corresponding to each field name in the data table; determine the total number of samples corresponding to the data table; when the number is less than the total number of samples, determine the field name as non- Full field name.
  • the missing type determination module 606 is further used to determine the missing type of the incomplete field name when the correlation between the incomplete field name and other field names in the data table is less than the first preset value It is completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, it is determined that the missing type of the incomplete field name is random missing; when the incomplete field name When the degree of correlation with at least one incomplete field name in the data table is greater than the third preset value, it is determined that the missing type of the incomplete field name is non-random missing.
  • the missing type is completely random missing; the missing value calculation module 608 is also used to calculate the corresponding value based on the existing data value of the incomplete field name when the data value type corresponding to the incomplete field name is character type The median of the data is taken as the missing value corresponding to the incomplete field name; or, the corresponding mode is counted according to the existing data values of the incomplete field name, and the statistical mode is used as the incomplete field name. Missing value; when the data value type corresponding to the incomplete field name is numeric, the corresponding average is calculated based on the existing data value of the incomplete field name, and the statistical average is used as the missing value corresponding to the incomplete field name .
  • the missing type is completely random missing; the missing value calculation module 608 is also used to determine the first type of sample in which the data value corresponding to the incomplete field name is missing from the data table; to determine the incomplete field name in the data table
  • the second type of samples with corresponding data values count the number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, the data value of the first type of sample under the name of the incomplete field Replace with the first value; replace the data value of the second type of sample under the incomplete field name with the second value.
  • the missing type is random missing; the missing value calculation module 608 is also used to determine the complete field name related to the incomplete field name; clustering the samples in the data table according to the data value of the complete field name, Get the cluster cluster; determine the third type of sample that lacks the data value corresponding to the incomplete field name in the data table; calculate the average value of the samples included in the cluster cluster of the third type sample under the incomplete field name, and calculate it The mean of is used as the missing value to be filled.
  • the missing type is random missing; the missing value calculation module 608 is further used to determine the first sample set where the data value corresponding to the incomplete field name in the data table exists and the missing data value corresponding to the incomplete field name The second sample set of; build a prediction model based on the data values corresponding to the full field names in the first sample set related to the incomplete field names; input the data values corresponding to the full field names of the samples in the second sample set into the prediction In the model, the predicted value of each sample in the second sample set under the name of the incomplete field is output through the prediction model; the predicted value is used as the missing value to be filled.
  • the data table filling device 600 further includes a correlation calculation module; the correlation calculation module is used to count the mean and standard deviation corresponding to each field name in the data table; according to the mean and standard deviation, calculate any according to the following formula The correlation between the two field names:
  • the above data table filling device 600 when acquiring the data table uploaded by the user, determines that the incomplete field name of the data value is missing in the data table, and according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Missing values are used to fill in the missing data values of the incomplete field names. According to the above steps, the missing data values of each incomplete field name in the data table can be filled, and the data table can be effectively filled. The accuracy of the analysis will also be significantly improved.
  • Each module in the above data table filling device 600 may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a terminal, and an internal structure diagram thereof may be as shown in FIG. 7.
  • the computer equipment includes a processor, a memory, a network interface, and an input device connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • the computer-readable instructions are executed by the processor to implement a data table filling method.
  • the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball, or a touch pad provided on the computer device shell, or an external keyboard, touch pad, or mouse.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • the data table filling apparatus may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on a computer device as shown in FIG. 7.
  • the memory of the computer device may store various program modules constituting the data table filling device 600, for example, the data table acquisition module 602, the incomplete field name determination module 604, the missing type determination module 606, and the missing value calculation module shown in FIG. 608 and fill module 610.
  • the computer-readable instructions formed by the various program modules cause the processor to execute the steps in the data table filling method described in each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 7 may execute step S202 through the data table acquisition module in the data table filling apparatus 600 shown in FIG. 6.
  • the computer device may execute step S204 through the incomplete field name determination module.
  • the computer device may execute step S206 through the missing type determination module.
  • the computer device may execute step S208 through the missing value calculation module.
  • the computer device may execute step S210 through the filling module.
  • a computer device which includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed The steps of the above data table filling method.
  • the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.
  • one or more non-volatile computer-readable storage media storing computer-readable instructions are provided.
  • the computer-readable instructions are executed by one or more processors, the one or more processors Perform the steps of the above data table filling method.
  • the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a data table filling method, comprising: obtaining a data table uploaded by a user; determining the name of an incomplete field in the data table, the incomplete field name missing a data value; according to the degree of association between the incomplete field name and other field names in the data table, determining a missing type of incomplete field name; according to the data values already in the data table, calculating the missing value according to a filling method corresponding to the missing type; according to the missing value, filling in the missing data value of the incomplete field name.

Description

数据表填补方法、装置、计算机设备和存储介质Data table filling method, device, computer equipment and storage medium
本申请要求于2019年01月02日提交中国专利局,申请号为201910001784.2,申请名称为“数据表填补方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires priority to be submitted to the China Patent Office on January 02, 2019, with the application number 201910001784.2 and the priority of the Chinese patent application titled "Data Sheet Filling Method, Device, Computer Equipment, and Storage Media", the entire content of which is cited by reference Incorporated in this application.
技术领域Technical field
本申请涉及一种数据表填补方法、装置、计算机设备和存储介质。The present application relates to a data table filling method, device, computer equipment and storage medium.
背景技术Background technique
报表数据是数据表中的数据,是实际应用中最常见的数据的形式之一,可用于进行数据分析或生成报表展示给用户,比如贷款业务数据、人力资源数据、保险业务数据等。然而,这些报表数据不可避免地由于操作不当、系统故障、人为因素等导致数据值的缺失。Report data is the data in the data table, which is one of the most common data forms in practical applications. It can be used for data analysis or report generation to users, such as loan business data, human resource data, insurance business data, etc. However, these report data inevitably lead to the lack of data values due to improper operation, system failure, human factors, etc.
然而,发明人意识到,在现有的商业数据报表平台中,通常不会对数据表中缺失的数据值进行处理,或者,直接删除缺失了数据值的样本,这样,往往会导致对整个数据表中报表数据的分布形成干扰,影响数据分析的准确度。However, the inventor realized that in the existing commercial data reporting platform, the missing data values in the data table are usually not processed, or the samples with missing data values are directly deleted, which often leads to the entire data. The distribution of report data in the table forms interference and affects the accuracy of data analysis.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种数据表填补方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a data table filling method, device, computer device, and storage medium are provided.
一种数据表填补方法包括:A data table filling method includes:
获取用户上传的数据表;Obtain the data table uploaded by the user;
确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and
根据所述缺失值填补所述非完全字段名缺失的数据值。The missing data value of the incomplete field name is filled according to the missing value.
一种数据表填补装置包括:A data table filling device includes:
数据表获取模块,用于获取用户上传的数据表;The data table acquisition module is used to obtain the data table uploaded by the user;
非完全字段名确定模块,用于确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;An incomplete field name determination module, configured to determine an incomplete field name in the data table, the incomplete field name is missing a data value;
缺失类型确定模块,用于根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;A missing type determining module, configured to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
缺失值计算模块,用于根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及A missing value calculation module, configured to calculate the missing value according to the existing data value in the data table and according to the filling method corresponding to the missing type; and
填补模块,用于根据所述缺失值填补所述非完全字段名缺失的数据值。The padding module is used to fill in the missing data value of the incomplete field name according to the missing value.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed The following steps:
获取用户上传的数据表;Obtain the data table uploaded by the user;
确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and
根据所述缺失值填补所述非完全字段名缺失的数据值。The missing data value of the incomplete field name is filled according to the missing value.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors perform the following steps:
获取用户上传的数据表;Obtain the data table uploaded by the user;
确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and
根据所述缺失值填补所述非完全字段名缺失的数据值。The missing data value of the incomplete field name is filled according to the missing value.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the drawings and description below. Other features and advantages of this application will become apparent from the description, drawings, and claims.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings based on these drawings without creative efforts.
图1为根据一个或多个实施例中数据表填补方法的应用场景图。FIG. 1 is an application scenario diagram of a data table filling method according to one or more embodiments.
图2为根据一个或多个实施例中数据表填补方法的流程示意图。FIG. 2 is a schematic flowchart of a data table filling method according to one or more embodiments.
图3为根据一个或多个实施例中根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值的步骤的流程示意图。FIG. 3 is a schematic flowchart of steps for calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to one or more embodiments.
图4为根据另一个或多个实施例中根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值的步骤的流程示意图。FIG. 4 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to another or more embodiments.
图5为根据又一个或多个实施例中根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值的步骤的流程示意图。FIG. 5 is a schematic flowchart of a step of calculating missing values according to the filling method corresponding to the missing type according to the existing data values in the data table according to yet another embodiment.
图6为根据一个或多个实施例中数据表填补装置的框图。6 is a block diagram of a data table filling device according to one or more embodiments.
图7为根据一个或多个实施例中计算机设备的框图。7 is a block diagram of a computer device according to one or more embodiments.
具体实施方式detailed description
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
本申请提供的数据表填补方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。终端102可获取用户上传的数据表,将数据表发送至服务器104,由服务器104计算数据表所包括字段名之间的相关度,并将任意两个字段名之间的相关度反馈至终端102,由终端102在确定了数据表中缺失了数据值的非完全字段后,按照该非完全字段名与数据表中其它字段名之间的相关度确定该非完全字段名的缺失类型。终端102还可进一步根据数据表中已有的数据值,根据该非完全字段名的缺失类型对应的填补方式计算缺失值,并根据缺失值填补非完全字段名缺失的数据值,将填补后的数据表发送至服务器104。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The data table filling method provided by this application can be applied in the application environment shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. The terminal 102 can obtain the data table uploaded by the user, send the data table to the server 104, and the server 104 calculates the correlation between the field names included in the data table, and feeds back the correlation between any two field names to the terminal 102 After determining the incomplete field in which the data value is missing in the data table, the terminal 102 determines the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table. The terminal 102 may further calculate the missing value according to the filling method corresponding to the missing type of the incomplete field name according to the existing data value in the data table, and fill in the missing data value of the incomplete field name according to the missing value. Data table is sent to the server 104. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
在一个实施例中,如图2所示,提供了一种数据表填补方法,以该方法应用于图1中的终端为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a data table filling method is provided. The method is applied to the terminal in FIG. 1 as an example for illustration, and includes the following steps:
步骤202,获取用户上传的数据表。Step 202: Obtain the data table uploaded by the user.
数据表是一种结构化的数据表格,比如可以是CSV(逗号分隔值,Comma-Separated Values)格式的表格,CSV数据表以纯文本形式存储表格数据,存储的表格数据包括数值型和字符型。具体地,可提供网页界面,用户通过该网页界面上传数据表,终端就可获取用户上传的数据表。在一个实施例中,每个用户需按预设的文件格式或表格模板生成包含报表数据的数据表,以便终端可解析出上传的数据表的表结构信息。A data table is a structured data table, such as a CSV (Comma-Separated Values) format. The CSV data table stores the table data in plain text. The stored table data includes numeric and character types. Specifically, a web interface may be provided, and the user uploads the data table through the web interface, and the terminal may obtain the data table uploaded by the user. In one embodiment, each user needs to generate a data table containing report data according to a preset file format or table template, so that the terminal can parse out the table structure information of the uploaded data table.
如下表1所示,为一个实施例中上传的CSV格式的数据表的示意图。As shown in Table 1 below, it is a schematic diagram of a CSV format data table uploaded in an embodiment.
Figure PCTCN2019122323-appb-000001
Figure PCTCN2019122323-appb-000001
表1Table 1
从上表1中可以看出,该数据表中每一行的元素之间用逗号分隔开,第一行的元素用于表示这一列的列名,也叫数据表的表头或字段名,相应的该列中的元素为字段名对应的数据值,一个字段名对应了多个数据值。从第二行起,每一行的数据表示该数据表中的一个样本,上表1中示出了4个样本。As can be seen from Table 1 above, the elements in each row of the data table are separated by commas, and the elements in the first row are used to represent the column name of this column, also called the header or field name of the data table, The corresponding elements in this column are the data values corresponding to the field names, and one field name corresponds to multiple data values. From the second row, the data in each row represents a sample in the data table, and four samples are shown in Table 1 above.
步骤204,确定数据表中的非完全字段名,非完全字段名缺少数据值。Step 204: Determine the incomplete field name in the data table. The incomplete field name is missing a data value.
非完全字段名是数据表中缺失了数据值的字段名,相应地,完全字段名是数据表中不缺失数据值的字段名。比如,在上表1中,属于非完全字段名的字段名包括:学历、贷款金额,属于完全字段名的字段名包括:姓名、性别、年龄、地区、贷款时间和身份证号码。The incomplete field name is the field name where the data value is missing in the data table, and accordingly, the complete field name is the field name where the data value is not missing in the data table. For example, in Table 1 above, field names that belong to incomplete field names include: education, loan amount, and field names that belong to complete field names include: name, gender, age, region, loan time, and ID number.
具体地,终端可在获取到用户上传的数据表后,确定数据表中缺失了数据值的各个字段名,即各个非完全字段名。Specifically, after acquiring the data table uploaded by the user, the terminal may determine that each field name of the data value in the data table is missing, that is, each incomplete field name.
在一个实施例中,确定数据表中的非完全字段名包括:统计数据表中各个字段名对应的数据值的数量;确定数据表对应的样本总数;当数量小于样本总数时,将字段名确定为非完全字段名。In one embodiment, determining the incomplete field names in the data table includes: counting the number of data values corresponding to each field name in the data table; determining the total number of samples corresponding to the data table; when the number is less than the total number of samples, determining the field name Is not a full field name.
具体地,对于数据表中所包括的字段名,终端可统计各个字段名对应的数据值的数量,并统计数据表所包括的样本的总数,当统计的字段名对应的数据值的数量小于样本总数时,说明该字段名缺失了数据值,则将该字段名确定为非完全字段名。Specifically, for the field names included in the data table, the terminal may count the number of data values corresponding to each field name, and count the total number of samples included in the data table. When the number of data values corresponding to the field names is less than the sample When the total number indicates that the field name is missing a data value, the field name is determined to be an incomplete field name.
比如,在前文提及的表1中,终端在遍历字段名“学历”对应的数据值的数量的过程中,每查询到一个非“空”(NULL)的数据值,相应的数量就增1,直至遍历完数据表中所有的样本,得到统计的字段名“学历”对应的数据值的数量为“3”,而样本总数为“4”,因此可确定字段名“学历”为非完全字段名。同样地,也可确定字段名“贷款金额”为非完全字段名。For example, in Table 1 mentioned above, when the terminal traverses the number of data values corresponding to the field name "Education", each time a non-"NULL" data value is queried, the corresponding number increases by 1. Until all the samples in the data table are traversed, the number of data values corresponding to the statistical field name "education" is "3", and the total number of samples is "4", so it can be determined that the field name "education" is an incomplete field name. Similarly, the field name "Loan Amount" can also be determined to be an incomplete field name.
步骤206,根据非完全字段名与数据表中其它字段名之间的相关度确定非完全字段名的缺失类型。Step 206: Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table.
相关度可表示两个字段名之间的隐含联系。两个字段名之间的相关度越大,代表这两个字段名之间的联系越强,反之,两个字段名之间的相关度越小,代表这两个字段名之间的联系越弱。比如,前文的表1中,贷款人所在的“地区”中,北京、深圳、上海由于房价普遍较高,相比于其它地区,“贷款金额”也会普遍偏高,说明字段名“地区”和“贷款金额”之间存在隐含联系。The degree of relevance can represent an implicit connection between two field names. The greater the correlation between the two field names, the stronger the connection between the two field names. Conversely, the smaller the correlation between the two field names, the more the connection between the two field names. weak. For example, in Table 1 above, in the "region" where the lender is located, Beijing, Shenzhen, and Shanghai have generally higher house prices. Compared with other regions, the "loan amount" will also be generally higher, indicating the field name "region" There is an implicit connection with the "loan amount".
缺失类型用于描述缺失了数据值的字段名与其它字段名之间可能的联系。确定非完全字段名的缺失类型便于采用相应的填补方式对缺失的数据值进行填补。缺失类型包括完全随机缺失、随机缺失和非随机缺失。需要说明的是,非完全字段名对应的缺失类型可以既是随机缺失又是非随机缺失,那么终端可按需采用相应的填补方式计算该非完全字段名对应的缺失值。The missing type is used to describe the possible connection between the field name where the data value is missing and other field names. Determining the missing type of incomplete field names facilitates the use of corresponding padding methods to fill in missing data values. Missing types include completely random missing, random missing and non-random missing. It should be noted that the missing type corresponding to the incomplete field name may be both random and non-random missing, then the terminal may calculate the missing value corresponding to the incomplete field name by using a corresponding filling method as needed.
具体地,终端可在确定了数据表中非完全字段名后,计算当前要填补的非完全字段名与数据表中其它字段名之间的相关度,按照相关度确定当前要填补的非完全字段名的缺失类型。Specifically, after determining the incomplete field names in the data table, the terminal may calculate the correlation between the incomplete field names to be filled and other field names in the data table, and determine the incomplete fields to be filled according to the correlation degree. The missing type of the name.
在一个实施例中,步骤206,根据非完全字段名与数据表中其它字段名之间的相关度确定非完全字段名的缺失类型包括:当非完全字段名与数据表中其它字段名之间的相关度均小于第一预设值时,则确定非完全字段名的缺失类型为完全随机缺失;当非完全字段名 与数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定非完全字段名的缺失类型为随机缺失;当非完全字段名与数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定非完全字段名的缺失类型为非随机缺失。In one embodiment, step 206, determining the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table includes: when the incomplete field name and other field names in the data table When the correlations of all are less than the first preset value, the missing type of the incomplete field name is determined to be completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset Value, the type of missing incomplete field name is determined to be random missing; when the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than the third preset value, the incomplete field name is determined The type of deletion is non-random deletion.
具体地,若终端可设置相应的阈值,计算当前要填补的非完全字段名与数据表中其它字段名之间的相关度,根据相关度与阈值的大小关系确定当前要填补的非完全字段名的缺失类型。Specifically, if the terminal can set a corresponding threshold, calculate the correlation between the incomplete field name to be filled and the other field names in the data table, and determine the incomplete field name to be filled according to the relationship between the correlation and the threshold Of missing types.
若当前要填补的非完全字段名与其它字段名之间的相关度均小于设置的第一预设值,说明该非完全字段名与其余字段名之间不存在隐含联系,从而该非完全字段名缺失的数据值与其它字段名对应的数据值之间也不存在关联,计算该非完全字段名对应的缺失值时不需要参考其它字段名对应的数据值,就可以确定该非完全字段名的缺失类型为完全随机缺失。If the correlation between the incomplete field name to be filled and other field names is less than the first preset value, it means that there is no implicit connection between the incomplete field name and the rest of the field names, so that the incomplete field name There is no correlation between the data value of the missing field name and the data value corresponding to the other field name. The missing value corresponding to the incomplete field name can be used to determine the incomplete field without referring to the data value corresponding to the other field name. The missing type of the name is completely random missing.
若当前要填补的非完全字段名与数据表中至少一个完全字段名之间的相关度大于设置的第二预设值,说明该非完全字段名与至少一个完全字段名之间存在一定的隐含联系,从而该非完全字段名缺失的数据值与至少一个完全字段名对应的数据值之间存在一定的关联,计算该非完全字段名对应的缺失值时需要参考该至少一个完全字段名的数据值,就可以确定该非完全字段名的缺失类型为随机缺失。If the correlation between the incomplete field name to be filled and the at least one complete field name in the data table is greater than the set second preset value, there is a certain hidden between the incomplete field name and at least one complete field name Contains links, so that the missing data value of the incomplete field name has a certain association with the data value corresponding to at least one complete field name. When calculating the missing value corresponding to the incomplete field name, you need to refer to the at least one complete field name. Based on the data value, it can be determined that the missing type of the incomplete field name is random missing.
若当前要填补的非完全字段名与数据表中至少一个非完全字段名之间的相关度大于设置的第三预设值,说明该非完全字段名与至少一个非完全字段名之间存在一定的隐含联系,从而该非完全字段名缺失的数据值与至少一个非完全字段名对应的数据值之间存在一定的关联,计算该非完全字段名对应的缺失值时需要参考该至少一个非完全字段名的数据值,就可以确定该非完全字段名的缺失类型为非随机缺失。If the correlation between the current incomplete field name to be filled and at least one incomplete field name in the data table is greater than the set third preset value, it means that there is a certain degree between the incomplete field name and at least one incomplete field name Implied connection, so that there is a certain correlation between the missing data value of the incomplete field name and the data value corresponding to at least one incomplete field name, you need to refer to the at least one non-complete field name when calculating the missing value corresponding to the incomplete field name With the data value of the complete field name, it can be determined that the missing type of the incomplete field name is non-random missing.
在其中一个实施例中,数据表填补方法还包括计算相关度的步骤:统计数据表中各个字段名对应的均值和标准差;根据均值和标准差,按照以下公式计算任意两个字段名之间的相关度:In one of the embodiments, the data table filling method further includes the step of calculating the relevance: the mean and standard deviation corresponding to each field name in the statistical data table; according to the mean and standard deviation, the calculation between any two field names is based on the following formula Of relevance:
Figure PCTCN2019122323-appb-000002
ρ (x,y)表示字段名X与字段名Y之间的相关度;μ X表示字段名X对应的均值;μ Y表示字段名Y对应的均值;σ X表示字段名X对应的标准差;σ Y表示字段名Y对应的标准差;E[(X-μ X)(Y-μ Y)]是Z的期望值,Z=(X iX)(Y iY)。
Figure PCTCN2019122323-appb-000002
ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the mean value corresponding to the field name X; μ Y represents the mean value corresponding to the field name Y; σ X represents the standard deviation corresponding to the field name X Σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X iX )(Y iY ).
具体地,终端可获取非完全字段名X的数据值,对所有的数据值求均值μ X,相应地,获取另一个字段名Y对应的数据值,对字段名Y所有的数据值求均值μ Y,然后根据标准差与均值之间的关系分别计算字段名X与字段名Y对应的标准差,可通过以下公式计算得到: Specifically, the terminal may obtain the data value of the incomplete field name X, and find the average value μ X of all the data values, and correspondingly, obtain the data value corresponding to another field name Y, and find the average value μ of all the data values of the field name Y. Y , and then calculate the standard deviation corresponding to the field name X and the field name Y according to the relationship between the standard deviation and the mean, which can be calculated by the following formula:
Figure PCTCN2019122323-appb-000003
Figure PCTCN2019122323-appb-000003
字段名X一共有N个数据值,X i表示字段名X对应的第i个数据值,然后终端可根据计算得到的均值和字段名的各个数据值计算Z的各个数据值,即Z的第i个数据值为(X iX)(Y iY),然后再根据Z的各个数据值统计Z的均值,作为期望值。 There are a total of N data values for the field name X, X i represents the i-th data value corresponding to the field name X, and then the terminal can calculate each data value of Z according to the calculated average and each data value of the field name, that is, the Z The i data values are (X iX )(Y iY ), and then the average value of Z is calculated according to each data value of Z as the expected value.
在一个实施例中,在计算非完全字段名与其它字段名之间的相关度时,若这两个字段名的数据值类型均为数值型时,可直接根据这两个字段名的数据值计算相关度,若这连个字段名中有至少一个字段名的数据值类型为字符型,则可先统计该字段名的枚举值,为每个枚举值匹配相应的数据值,这样就可以将字符型的数据值转化成数值型的数据值,然后按照匹配的数据值计算相关度。In one embodiment, when calculating the correlation between incomplete field names and other field names, if the data value types of the two field names are both numeric, the data values of the two field names can be directly used Calculate the relevance. If at least one of the field names has at least one field name whose data value type is character, you can first count the enumeration value of the field name and match the corresponding data value for each enumeration value. You can convert character data values to numeric data values, and then calculate the relevance based on the matching data values.
比如,对于数据表中的字段名“学历”而言,统计该字段名对应的枚举值,包括:博士、硕士、本科、大专、中专、初中以及不详,可依次转换为相应的数据值,比如6、5、4、3、2、1以及0,或者,依次转换为100、80、70、60、50、20以及0,然后根据转换后的数据值计算相关度。转换后的各个数据值之间的关系应当和转换之前字符型的数据值之间的关系保持一致。For example, for the field name "education" in the data table, the enumeration values corresponding to the field name are counted, including: Ph.D., Master, Undergraduate, Junior College, Technical Secondary School, Junior High School, and unknown, which can be converted into corresponding data values in turn , Such as 6, 5, 4, 3, 2, 1, and 0, or sequentially converted to 100, 80, 70, 60, 50, 20, and 0, and then calculate the correlation based on the converted data value. The relationship between the converted data values should be consistent with the relationship between the character data values before conversion.
步骤208,根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值。Step 208: Calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table.
具体地,在终端确定了当前要填补的非完全字段名对应的缺失类型后,就可以根据该缺失类型对应的填补方式,根据数据表中已有的数据值计算该非完全字段名对应的缺失值。数据表中已有的数据可以大致分为两类,一类是该非完全字段名对应的数据值,一类是与该非完全字段名相关的字段名对应的数据值。Specifically, after the terminal determines the missing type corresponding to the current incomplete field name to be filled, it can calculate the missing corresponding to the incomplete field name according to the existing data value in the data table according to the filling method corresponding to the missing type value. The existing data in the data table can be roughly divided into two categories, one is the data value corresponding to the incomplete field name, and the other is the data value corresponding to the field name related to the incomplete field name.
在其中一个实施例中,缺失类型为完全随机缺失;步骤208,根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值包括:当非完全字段名对应的数据值类型为字符型时,则根据非完全字段名已有的数据值统计相应的中位数,将统计的中位数作为非完全字段名对应的缺失值;或,根据非完全字段名已有的数据值统计相应的众数,将统计的众数作为非完全字段名对应的缺失值;当非完全字段名对应的数据值类型为数值型时,则根据非完全字段名已有的数据值统计相应的平均数,将统计的平均数作为非完全字段名对应的缺失值。In one of the embodiments, the missing type is a completely random missing; step 208, according to the existing data values in the data table, calculating the missing value according to the filling method corresponding to the missing type includes: when the data value type corresponding to the incomplete field name is For character type, the corresponding median is counted based on the existing data value of the incomplete field name, and the statistical median is used as the missing value corresponding to the incomplete field name; or, based on the existing data value of the incomplete field name The corresponding mode is counted, and the statistical mode is taken as the missing value corresponding to the incomplete field name; when the data value type corresponding to the incomplete field name is numeric, the corresponding data is counted according to the existing data value of the incomplete field name For the average number, use the statistical average as the missing value corresponding to the incomplete field name.
具体地,当非完全字段名对应的缺失类型为完全随机缺失时,则说明该非完全字段名缺失的数据值与数据表中其它字段名已有的数据值之间的联系不大,则终端可根据该非完全字段名自身已有的数据值来计算缺失值。Specifically, when the missing type corresponding to the incomplete field name is completely random missing, it means that there is little connection between the missing data value of the incomplete field name and the existing data value of other field names in the data table, then the terminal The missing value can be calculated based on the existing data value of the incomplete field name itself.
字段名对应的数据值类型为字符型,是指该字段名对应的数据值的类型是字符型的,数据值类型为数值型,是指该字段名对应的数据值的类型是纯数值型的。比如,在前文提及的表1中,完全字段名“年龄”对应的数据值类型是数值型,非完全字段名“学历”对应的数据值类型是字符型,非完全字段名“贷款金额”对应的数据值是数值型。The data value type corresponding to the field name is character type, which means that the type of the data value corresponding to the field name is character type, and the data value type is numeric type, which means that the type of the data value corresponding to the field name is pure numeric type. . For example, in Table 1 mentioned earlier, the data value type corresponding to the complete field name "age" is numeric, the data value type corresponding to the incomplete field name "Education" is character type, and the incomplete field name "Loan Amount" The corresponding data value is numeric.
当终端确定当前待填补的非完全字段名的缺失类型为完全随机缺失时,并且该非完全字段名对应的数据值类型为字符型时,则终端可根据该非完全字段名已有的数据值统计相应的中位数,将统计的中位数作为该非完全字段名对应的缺失值;或者,终端也可根据该非完全字段名已有的数据值统计相应的众数,将统计的众数作为该非完全字段名对应的缺失值。When the terminal determines that the current missing type of the incomplete field name to be filled is completely random missing, and the data value type corresponding to the incomplete field name is character type, the terminal can use the existing data value of the incomplete field name Count the corresponding median, and use the median as the missing value corresponding to the incomplete field name; or, the terminal can also calculate the corresponding mode according to the existing data value of the incomplete field name The number is the missing value corresponding to the incomplete field name.
当终端确定当前待填补的非完全字段名的缺失类型为完全随机缺失时,并且该非完全字段名对应的数据值类型为数值型时,则终端可根据该非完全字段名已有的数据值统计相应的平均数,将统计的平均数作为该非完全字段名对应的缺失值。When the terminal determines that the missing type of the incomplete field name to be filled is completely random missing, and the data value type corresponding to the incomplete field name is numeric, the terminal may use the existing data value of the incomplete field name Count the corresponding average number, and use the averaged number as the missing value corresponding to the incomplete field name.
步骤210,根据缺失值填补非完全字段名缺失的数据值。Step 210: Fill in the missing data value of the incomplete field name according to the missing value.
具体地,终端在按照上述步骤202至步骤204计算出数据表中每个非完全字段名相应的缺失值后,就可用各自的缺失值填补非完全字段名缺失的数据值。填补之后的数据表中不再有缺失的数据值,这样,方便基于填补后的数据表进行数据分析或数据统计。Specifically, after calculating the missing value corresponding to each incomplete field name in the data table according to the above steps 202 to 204, the terminal can use the respective missing value to fill in the missing data value of the incomplete field name. There is no longer missing data value in the filled data table, so that it is convenient for data analysis or statistics based on the filled data table.
上述数据表填补方法,在获取到用户上传的数据表时,就确定该数据表中缺失了数据值的非完全字段名,按照该非完全字段名与数据表中其它字段名之间的相关度确定该非完全字段名的缺失类型,然后根据数据表中已有的数据值按照该非完全字段名的缺失类型所对应的填补方式计算该非完全字段名对应的缺失值,就可以用该缺失值填补该非完全字段名缺失的数据值,按照上述步骤,可以填补数据表中各个非完全字段名缺失的数据值,能够有效填补数据表,这样,基于填补后的数据表所进行的数据分析的准确度也会显著提高。In the above data table filling method, when the data table uploaded by the user is obtained, it is determined that the incomplete field name of the data value is missing in the data table, according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Value to fill in the missing data value of the incomplete field name, according to the above steps, you can fill in the missing data value of each incomplete field name in the data table, can effectively fill the data table, in this way, based on the data analysis of the filled data table The accuracy will also be significantly improved.
如图3所示,在其中一个实施例中,缺失类型为完全随机缺失;步骤208,根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值包括:As shown in FIG. 3, in one of the embodiments, the missing type is a completely random missing; step 208, according to the existing data value in the data table, calculating the missing value according to the filling method corresponding to the missing type includes:
步骤302,确定数据表中缺失了非完全字段名对应的数据值的第一类样本; Step 302, it is determined that the first type sample of the data value corresponding to the incomplete field name is missing from the data table;
步骤304,确定数据表中非完全字段名对应的数据值存在的第二类样本;Step 304: Determine the second type of sample that exists in the data value corresponding to the incomplete field name in the data table;
样本是数据表中记录的数据条目,每个样本在各个字段名下都有各自的数据值。第一类样本是数据表中当前要填补的非完全字段名对应的数据值缺失的样本,第二类样本是数据表中当前要填补的非完全字段名对应的数据值存在的样本。比如,在前文提及的表1中,针对当前要填补的非完全字段名“贷款金额”而言,第二个样本属于第一类样本,第一个样本、第三个样本以及第四个样本属于第二类样本;而针对当前要填补的非完全字段名“地区”而言,第四个样本属于第一类本,第一个样本、第二个样本以及第三个样本属于第二类样本。Samples are data entries recorded in the data table, and each sample has its own data value under each field name. The first type of sample is a sample with missing data value corresponding to the incomplete field name to be filled in the data table, and the second type of sample is a sample with data value corresponding to the incomplete field name to be filled in the data table. For example, in Table 1 mentioned above, the second sample belongs to the first type of sample, the first sample, the third sample, and the fourth sample for the incomplete field name "Loan Amount" to be filled. The sample belongs to the second type of sample; and for the current incomplete field name "region" to be filled, the fourth sample belongs to the first type, the first sample, the second sample, and the third sample belong to the second Class samples.
步骤306,统计第一类样本的样本数量;Step 306: Count the number of samples of the first type of samples;
步骤308,计算样本数量占样本总数的比例; Step 308, calculating the ratio of the number of samples to the total number of samples;
具体地,当前要填补的非完全字段的缺失类型为完全随机缺失,则说明当前要填补的非完全字段名与数据表中其它字段名之间的联系不大。终端可统计第一类样本的样本数量,计算第一类样本的样本数量占数据表中样本总数的比例。Specifically, the type of deletion of the incomplete field currently to be filled is a completely random deletion, which means that there is not much connection between the name of the incomplete field to be filled and the names of other fields in the data table. The terminal can count the number of samples of the first type of sample and calculate the ratio of the number of samples of the first type of sample to the total number of samples in the data table.
步骤310,当比例大于阈值时,则将第一类样本在非完全字段名下的数据值替换为第一值;将第二类样本在非完全字段名下的数据值替换为第二值。 Step 310, when the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced with the first value; the data value of the second type of sample under the incomplete field name is replaced with the second value.
当该比例大于阈值时,说明数据表中当前要填补的非完全字段名对应的数据值缺失的样本较多,比如阈值可以设置成50%,那如果有超过一半的样本在当前要填补的非完全字段名下的数据值都缺失了,势必会影响数据分析和数据统计,而该非完全字段名又与其它字段名的联系不大,那么终端可将该非完全字段名对应的数据值二值化,将第一类样本在该非完全字段名下的数据值替换为第一值;将第二类样本在该非完全字段名下的数据值替换为第二值。When the ratio is greater than the threshold, it means that there are many samples with missing data values corresponding to the incomplete field names to be filled in the data table. For example, the threshold can be set to 50%. If more than half of the samples are filled in The data values under the complete field name are missing, which will inevitably affect data analysis and data statistics, and the incomplete field name has little connection with other field names, then the terminal can use the data value corresponding to the incomplete field name. By value, the data value of the first type sample under the incomplete field name is replaced with the first value; the data value of the second type sample under the incomplete field name is replaced with the second value.
比如,终端在确定了数据表中的非完全字段名“身份证号码”属于完全随机类型后,并统计超过一半的样本属于第一类样本,也就是,超过一半的样本在“身份证号码”这个字段名下的数据值是缺失的,那么终端可将存在数据值的样本在“身份证号码”这个字段名下的数据值替换为“1”,将缺失了数据值的样本在“身份证号码”这个字段名下的数据值替换为“0”,这样,虽然缺失了大量的数据值,但是由于该数据值与数据表中其它已有的数据值的关联不大,用二值化的方式替换原来的数据值,相比于直接删除该非完全字段名下所有的数据值而言,又可以保留了一定的信息。For example, after the terminal determines that the incomplete field name "ID number" in the data table belongs to a completely random type, and counts more than half of the samples belong to the first type of sample, that is, more than half of the samples are in the "ID number" The data value under this field name is missing, then the terminal can replace the data value under the field name of the sample with the data value in the "ID card number" to "1", and the sample with the missing data value in the "ID card" The data value under the field name "Number" is replaced with "0". In this way, although a large number of data values are missing, the data value is not related to other existing data values in the data table. Replacing the original data value in a way can retain certain information compared to directly deleting all data values under the incomplete field name.
如图4所示,在其中一个实施例中,缺失类型为随机缺失;步骤208,根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值包括:As shown in FIG. 4, in one of the embodiments, the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
步骤402,确定与非完全字段名相关的完全字段名;Step 402: Determine the complete field name related to the incomplete field name;
具体地,当非完全字段名的缺失类型为随机缺失时,说明该非完全字段名与数据表中至少一个完全字段名相关,终端可按照步骤206确定与当前要填补的非完全字段名相关的完全字段名。Specifically, when the missing type of the incomplete field name is randomly missing, it indicates that the incomplete field name is related to at least one complete field name in the data table, and the terminal may determine the relevant incomplete field name to be filled according to step 206. Full field name.
步骤404,按照完全字段名的数据值对数据表中的样本进行聚类,得到聚类簇; Step 404, cluster the samples in the data table according to the data value of the complete field name to obtain a clustering cluster;
具体地,终端在确定了数据表中与当前要填补的非完全字段名相关的至少一个完全字段名后,就可将数据表中的所有样本按照至少一个完全字段名对应的数据值进行聚类,得到聚类簇。Specifically, after determining at least one complete field name in the data table related to the current incomplete field name to be filled, the terminal may cluster all samples in the data table according to the data value corresponding to at least one complete field name To get a cluster.
在一个实施例中,终端可按照确定的至少一个完全字段名的对应的数据值之间的相似性将所有的样本进行聚类,或者,终端也可将完全字段名对应的数据值对应至多个类别中,然后按数据值对应的类别进行聚类。In one embodiment, the terminal may cluster all samples according to the determined similarity between corresponding data values of at least one complete field name, or the terminal may also correspond to multiple data values corresponding to complete field names to multiple In the category, then cluster according to the category corresponding to the data value.
比如,针对与非完全字段名“年终奖”相关的完全字段名“工作年限”而言,终端可以按照完全字段名“工作年限”对数据表中所有的样本进行聚类,比如可将工作年限为1年及2年的样本归为一类,将工作年限为3年至5年的样本归为一类,将工作年限为6年至8年的样本归为一类,将工作年限为8年以上的样本归为一类。当与“年终奖”相关的完全字段名有多个时,可结合这多个完全字段名对应的数据值将数据表中的样本聚类,得到各个聚类簇。For example, for the complete field name "Working Year" related to the incomplete field name "End of Year Award", the terminal can cluster all the samples in the data table according to the complete field name "Working Year", for example, the working year Samples of 1 year and 2 years are classified into one category, samples with working years of 3 to 5 years are classified into one category, samples with working years of 6 to 8 years are classified into one category, and working years are 8 Samples older than one year are grouped together. When there are multiple complete field names related to the "year-end prize", the data values corresponding to the multiple complete field names can be combined to cluster the samples in the data table to obtain each cluster.
步骤406,确定数据表中缺失了非完全字段名对应的数据值的第三类样本; Step 406, it is determined that the third type sample of the data value corresponding to the incomplete field name is missing from the data table;
进一步地,终端统计出数据表中当前要填补的非完全字段名缺失的第三类样本,并确定这些第三类样本属于步骤404中得到的哪一个聚类簇中。Further, the terminal counts the third-type samples with missing incomplete field names currently to be filled in the data table, and determines which of the clusters obtained in step 404 these third-type samples belong to.
步骤408,计算第三类样本所属的聚类簇所包括样本在非完全字段名下的均值,将计算得到的均值作为待填补的缺失值。Step 408: Calculate the average value of the samples included in the clusters of the third type of samples under the name of the incomplete field, and use the calculated average value as the missing value to be filled.
具体地,终端可在确定第三类样本所属的聚类簇后,计算该聚类簇中所有样本在要填补的非完全字段名下的均值,将计算得到的均值作为落在该聚类簇中的样本在要填补的非完全字段名下的缺失值。Specifically, after determining the cluster cluster to which the third type of sample belongs, the terminal may calculate the average value of all samples in the cluster cluster under the name of the incomplete field to be filled, and use the calculated average value as falling within the cluster cluster. The missing value of the sample in under the name of the incomplete field to be filled.
在本实施例中,当缺失类型为随机缺失时,可将样本聚类后为缺失了数据值的样本计算相应的缺失值,相比于用同一个缺失值去填补所有的样本在该非完全字段名下的数据值而言,填补的数据值更为准确。In this embodiment, when the missing type is random missing, the corresponding missing value can be calculated for the samples with missing data values after clustering the samples. Compared with using the same missing value to fill all samples in the incomplete In terms of the data value under the field name, the filled data value is more accurate.
如图5所示,在其中一个实施例中,缺失类型为随机缺失;步骤208,根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值包括:As shown in FIG. 5, in one of the embodiments, the missing type is random missing; in step 208, calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
步骤502,确定数据表中非完全字段名对应的数据值存在的第一样本集合以及非完全字段名对应的数据值缺失的第二样本集合;Step 502: Determine a first sample set where the data value corresponding to the incomplete field name exists in the data table and a second sample set where the data value corresponding to the incomplete field name is missing;
在本实施例中,当要填补的非完全字段名的缺失类型是随机缺失时,终端还可根据数据表中与该非完全字段名相关的完全字段名对应的数据值构建预测模型,用预测模型预测非完全字段名缺失的数据值。具体地,终端可先将数据表中的所有样本分成两类,一类是当前要填补的非完全字段名对应的数据值存在的样本,这些样本构成的集合称之为第一样本集合,另一类是当前要填补的非完全字段名对应的数据值缺失的样本,这些样本构成的集合称之为第二样本集合。In this embodiment, when the missing type of the incomplete field name to be filled is randomly missing, the terminal may also construct a prediction model according to the data value corresponding to the complete field name related to the incomplete field name in the data table, and use the prediction The model predicts data values with missing incomplete field names. Specifically, the terminal may first divide all the samples in the data table into two types. One type is the sample where the data value corresponding to the incomplete field name to be filled currently exists. The set formed by these samples is called the first sample set. The other type is the samples with missing data values corresponding to the incomplete field names to be filled at present. The set formed by these samples is called the second sample set.
步骤504,根据第一样本集合中与非完全字段名相关的完全字段名对应的数据值构建预测模型;Step 504: Construct a prediction model according to the data values corresponding to the complete field names in the first sample set related to the incomplete field names;
进一步地,终端可确定与当前要填补的非完全字段名相关的完全字段名,然后获取第一样本集合中所有样本在确定的完全字段名下的数据值,建立这些数据值与该非完全字段名对应的数据值之间的预测关系。Further, the terminal may determine the complete field name related to the current incomplete field name to be filled, and then obtain the data values of all samples in the first sample set under the determined complete field name, and establish these data values and the incomplete field name. The prediction relationship between the data values corresponding to the field names.
步骤506,将第二样本集合中各个样本在完全字段名对应的数据值输入至预测模型中,通过预测模型输出第二样本集合中各个样本在非完全字段名下的预测值;Step 506: Input the data value corresponding to the complete field name of each sample in the second sample set into the prediction model, and output the predicted value of each sample in the second sample set under the incomplete field name through the prediction model;
步骤508,将预测值作为待填补的缺失值。Step 508: Use the predicted value as the missing value to be filled.
举例说明,将数据表中的所有样本按照非完全字段名m是否存在分为两类后得到的第一样本集合X=(001、002、003、005、…),001代表第一个样本,002代表第2个样本,诸如此类,第二样本集合X’=(004、006、…)。第一样本集合X中各个样本在非完全字段名m下对应的数据值的集合是m=(m1、m2、m3、m5…);第二样本集合X’中各个样本在非完全字段名N1下对应的数据值是缺失的。确定与非完全字段名m相关的完全字段名, 包括n、p、q。获取第一样本集合X中各个样本在完全字段名n、p、q下的数据值,根据n=(n1、n2、n3、n5…)、p=(n1、n2、n3、n5…)、q=(n1、n2、n3、n5…)与集合m=(m1、m2、m3、m5…)之间隐藏的联系构建预测模型:For example, the first sample set X = (001, 002, 003, 005, ...) obtained after dividing all samples in the data table into two categories according to whether the incomplete field name m exists or not, 001 represents the first sample , 002 represents the second sample, and so on, the second sample set X'= (004, 006, ...). The corresponding data value set of each sample in the first sample set X under the incomplete field name m is m=(m1, m2, m3, m5...); each sample in the second sample set X'is in the incomplete field name The corresponding data value under N1 is missing. Determine the complete field names related to the incomplete field name m, including n, p, and q. Obtain the data values of each sample in the first sample set X under the complete field names n, p, and q, according to n = (n1, n2, n3, n5...), p = (n1, n2, n3, n5...) , Q=(n1, n2, n3, n5...) and the hidden connection between the set m=(m1, m2, m3, m5...) to build a prediction model:
m=nw1+pw2+qw3+b,w1、w2、w3和b是可训练的模型参数。m=nw1+pw2+qw3+b, w1, w2, w3 and b are trainable model parameters.
这里模型只是一个示例,仅用于表示预测模型的输入是n、p和q,输出是m。在构建预测模型时可采用梯度递减的方式调整模型参数,使得构建的预测模型能够贴合第一样本集合中的每个样本。The model here is just an example, which is only used to indicate that the input of the prediction model is n, p, and q, and the output is m. When constructing the prediction model, the parameters of the model can be adjusted in a gradient decreasing manner, so that the constructed prediction model can fit each sample in the first sample set.
在得到了预测模型后,就可以将第二样本集合中各个样本在完全字段名n、p、q下的数据值作为输入,输入至预测模型中,通过该预测模型输出各个样本在非完全字段名m下对应的数据值,就可以用输出的预测值作为缺失的数据值进行填充,这样,每个样本在非完全字段名m下对应的缺失值都并不是完全相同的,而是和相关的完全字段名有很大的联系,能够提升待填补的缺失值的准备性。After the prediction model is obtained, the data values of each sample in the second sample set under the complete field names n, p, and q can be input into the prediction model, and each sample is output in the incomplete field through the prediction model The corresponding data value under the name m can be filled with the output predicted value as the missing data value. In this way, the corresponding missing value of each sample under the incomplete field name m is not exactly the same, but is related to The complete field name of the has a great connection, which can improve the readiness of the missing value to be filled.
在一个具体的实施例中,数据表填补方法具体包括以下步骤:In a specific embodiment, the data table filling method specifically includes the following steps:
获取用户上传的数据表。Get the data table uploaded by the user.
确定数据表中缺失了数据值的非完全字段名。Determine the incomplete field name where the data value is missing in the data table.
统计数据表中各个字段名对应的均值和标准差。The mean and standard deviation corresponding to each field name in the statistical data table.
根据均值和标准差,按照以下公式计算任意两个字段名之间的相关度:Based on the mean and standard deviation, the correlation between any two field names is calculated according to the following formula:
Figure PCTCN2019122323-appb-000004
Figure PCTCN2019122323-appb-000004
ρ (x,y)表示字段名X与字段名Y之间的相关度;μ X表示字段名X对应的均值;μ Y表示字段名Y对应的均值;σ X表示字段名X对应的标准差;σ Y表示字段名Y对应的标准差;E[(X-μ X)(Y-μ Y)]是Z的期望值,Z=(X-μ X)(Y-μ Y)。 ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the mean value corresponding to the field name X; μ Y represents the mean value corresponding to the field name Y; σ X represents the standard deviation corresponding to the field name X ; Σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
当非完全字段名与数据表中其它字段名之间的相关度均小于第一预设值时,则确定非完全字段名的缺失类型为完全随机缺失。When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing.
当非完全字段名与数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定非完全字段名的缺失类型为随机缺失。When the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, it is determined that the missing type of the incomplete field name is random missing.
当非完全字段名与数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定非完全字段名的缺失类型为非随机缺失。When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than the third preset value, it is determined that the missing type of the incomplete field name is non-random missing.
当缺失类型为完全随机缺失、且当非完全字段名对应的数据值类型为字符型时,则根据非完全字段名已有的数据值统计相应的中位数,将统计的中位数作为非完全字段名对应的缺失值;或,根据非完全字段名已有的数据值统计相应的众数,将统计的众数作为非完全字段名对应的缺失值。When the missing type is completely random missing, and when the data value type corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the non-complete The missing value corresponding to the complete field name; or, the corresponding mode is counted according to the existing data value of the incomplete field name, and the statistical mode is used as the missing value corresponding to the incomplete field name.
当缺失类型为完全随机缺失、且当非完全字段名对应的数据值类型为数值型时,则根据非完全字段名已有的数据值统计相应的平均数,将统计的平均数作为非完全字段名对应 的缺失值;或者,When the missing type is completely random missing, and when the data value type corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the statistical average is used as the incomplete field Missing value corresponding to the name; or,
当缺失类型为完全随机缺失时,确定数据表中缺失了非完全字段名对应的数据值的第一类样本;确定数据表中非完全字段名对应的数据值存在的第二类样本;统计第一类样本的样本数量;计算样本数量占样本总数的比例;当比例大于阈值时,则将第一类样本在非完全字段名下的数据值替换为第一值;将第二类样本在非完全字段名下的数据值替换为第二值。When the missing type is completely random missing, it is determined that the first type sample of the data value corresponding to the incomplete field name is missing in the data table; the second type sample of the data value corresponding to the incomplete field name in the data table is determined; The number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, replace the data value of the first type of sample under the incomplete field name with the first value; replace the second type of sample in the non The data value under the full field name is replaced with the second value.
当缺失类型为随机缺失时,则确定与非完全字段名相关的完全字段名;按照完全字段名的数据值对数据表中的样本进行聚类,得到聚类簇;确定数据表中缺失了非完全字段名对应的数据值的第三类样本;计算第三类样本所属的聚类簇所包括样本在非完全字段名下的均值,将计算得到的均值作为待填补的缺失值;或者,When the missing type is random missing, determine the complete field name related to the incomplete field name; cluster the samples in the data table according to the data value of the complete field name to obtain a cluster cluster; determine that the non-complete field is missing in the data table The third type sample of the data value corresponding to the complete field name; calculate the average value of the samples included in the cluster of the third type sample under the incomplete field name, and use the calculated average value as the missing value to be filled; or,
当缺失类型为随机缺失时,则确定数据表中非完全字段名对应的数据值存在的第一样本集合以及非完全字段名对应的数据值缺失的第二样本集合;根据第一样本集合中与非完全字段名相关的完全字段名对应的数据值构建预测模型;将第二样本集合中各个样本在完全字段名对应的数据值输入至预测模型中,通过预测模型输出第二样本集合中各个样本在非完全字段名下的预测值;将预测值作为待填补的缺失值。When the missing type is random missing, determine the first sample set where the data value corresponding to the incomplete field name exists in the data table and the second sample set where the data value corresponding to the incomplete field name is missing; according to the first sample set The data values corresponding to the complete field names related to the incomplete field names are used to construct the prediction model; the data values corresponding to the complete field names of each sample in the second sample set are input into the prediction model, and the second sample set is output through the prediction model The predicted value of each sample under the name of the incomplete field; use the predicted value as the missing value to be filled.
根据缺失值填补非完全字段名缺失的数据值。Fill in missing data values for incomplete field names based on missing values.
上述数据表填补方法,在获取到用户上传的数据表时,就确定该数据表中缺失了数据值的非完全字段名,按照该非完全字段名与数据表中其它字段名之间的相关度确定该非完全字段名的缺失类型,然后根据数据表中已有的数据值按照该非完全字段名的缺失类型所对应的填补方式计算该非完全字段名对应的缺失值,就可以用该缺失值填补该非完全字段名缺失的数据值,按照上述步骤,可以填补数据表中各个非完全字段名缺失的数据值,能够有效填补数据表,这样,基于填补后的数据表所进行的数据分析的准确度也会显著提高。In the above data table filling method, when the data table uploaded by the user is obtained, it is determined that the incomplete field name of the data value is missing in the data table, according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Value to fill in the missing data value of the incomplete field name, according to the above steps, you can fill in the missing data value of each incomplete field name in the data table, can effectively fill the data table, in this way, based on the data analysis of the filled data table The accuracy will also be significantly improved.
应该理解的是,虽然图2至图5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2至图5中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 2 to 5 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2 to 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or The execution order of the stages is not necessarily sequential, but may be executed in turn or alternately with other steps or sub-steps of the other steps or at least a part of the stages.
在一个实施例中,如图6所示,提供了一种数据表填补装置600,包括:数据表获取模块602、非完全字段名确定模块604、缺失类型确定模块606、缺失值计算模块608和填补模块610,其中:In one embodiment, as shown in FIG. 6, a data table filling device 600 is provided, including: a data table acquisition module 602, an incomplete field name determination module 604, a missing type determination module 606, a missing value calculation module 608 and Fill module 610, where:
数据表获取模块602,用于获取用户上传的数据表;The data table obtaining module 602 is used to obtain the data table uploaded by the user;
非完全字段名确定模块604,用于确定数据表中的非完全字段名,非完全字段名缺少 数据值;The incomplete field name determination module 604 is used to determine the incomplete field name in the data table, and the incomplete field name lacks the data value;
缺失类型确定模块606,用于根据非完全字段名与数据表中其它字段名之间的相关度确定非完全字段名的缺失类型;The missing type determination module 606 is used to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
缺失值计算模块608,用于根据数据表中已有的数据值,根据缺失类型对应的填补方式计算缺失值;The missing value calculation module 608 is used to calculate the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table;
填补模块610,用于根据缺失值填补非完全字段名缺失的数据值。The filling module 610 is used to fill in the missing data value of the incomplete field name according to the missing value.
在其中一个实施例中,缺失类型确定模块606还用于统计数据表中各个字段名对应的数据值的数量;确定数据表对应的样本总数;当数量小于样本总数时,将字段名确定为非完全字段名。In one of the embodiments, the missing type determination module 606 is also used to count the number of data values corresponding to each field name in the data table; determine the total number of samples corresponding to the data table; when the number is less than the total number of samples, determine the field name as non- Full field name.
在其中一个实施例中,缺失类型确定模块606还用于当非完全字段名与数据表中其它字段名之间的相关度均小于第一预设值时,则确定非完全字段名的缺失类型为完全随机缺失;当非完全字段名与数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定非完全字段名的缺失类型为随机缺失;当非完全字段名与数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定非完全字段名的缺失类型为非随机缺失。In one of the embodiments, the missing type determination module 606 is further used to determine the missing type of the incomplete field name when the correlation between the incomplete field name and other field names in the data table is less than the first preset value It is completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, it is determined that the missing type of the incomplete field name is random missing; when the incomplete field name When the degree of correlation with at least one incomplete field name in the data table is greater than the third preset value, it is determined that the missing type of the incomplete field name is non-random missing.
在其中一个实施例中,缺失类型为完全随机缺失;缺失值计算模块608还用于当非完全字段名对应的数据值类型为字符型时,则根据非完全字段名已有的数据值统计相应的中位数,将统计的中位数作为非完全字段名对应的缺失值;或,根据非完全字段名已有的数据值统计相应的众数,将统计的众数作为非完全字段名对应的缺失值;当非完全字段名对应的数据值类型为数值型时,则根据非完全字段名已有的数据值统计相应的平均数,将统计的平均数作为非完全字段名对应的缺失值。In one of the embodiments, the missing type is completely random missing; the missing value calculation module 608 is also used to calculate the corresponding value based on the existing data value of the incomplete field name when the data value type corresponding to the incomplete field name is character type The median of the data is taken as the missing value corresponding to the incomplete field name; or, the corresponding mode is counted according to the existing data values of the incomplete field name, and the statistical mode is used as the incomplete field name. Missing value; when the data value type corresponding to the incomplete field name is numeric, the corresponding average is calculated based on the existing data value of the incomplete field name, and the statistical average is used as the missing value corresponding to the incomplete field name .
在其中一个实施例中,缺失类型为完全随机缺失;缺失值计算模块608还用于确定数据表中缺失了非完全字段名对应的数据值的第一类样本;确定数据表中非完全字段名对应的数据值存在的第二类样本;统计第一类样本的样本数量;计算样本数量占样本总数的比例;当比例大于阈值时,则将第一类样本在非完全字段名下的数据值替换为第一值;将第二类样本在非完全字段名下的数据值替换为第二值。In one of the embodiments, the missing type is completely random missing; the missing value calculation module 608 is also used to determine the first type of sample in which the data value corresponding to the incomplete field name is missing from the data table; to determine the incomplete field name in the data table The second type of samples with corresponding data values; count the number of samples of the first type of sample; calculate the proportion of the number of samples to the total number of samples; when the ratio is greater than the threshold, the data value of the first type of sample under the name of the incomplete field Replace with the first value; replace the data value of the second type of sample under the incomplete field name with the second value.
在其中一个实施例中,缺失类型为随机缺失;缺失值计算模块608还用于确定与非完全字段名相关的完全字段名;按照完全字段名的数据值对数据表中的样本进行聚类,得到聚类簇;确定数据表中缺失了非完全字段名对应的数据值的第三类样本;计算第三类样本所属的聚类簇所包括样本在非完全字段名下的均值,将计算得到的均值作为待填补的缺失值。In one of the embodiments, the missing type is random missing; the missing value calculation module 608 is also used to determine the complete field name related to the incomplete field name; clustering the samples in the data table according to the data value of the complete field name, Get the cluster cluster; determine the third type of sample that lacks the data value corresponding to the incomplete field name in the data table; calculate the average value of the samples included in the cluster cluster of the third type sample under the incomplete field name, and calculate it The mean of is used as the missing value to be filled.
在其中一个实施例中,缺失类型为随机缺失;缺失值计算模块608还用于确定数据表中非完全字段名对应的数据值存在的第一样本集合以及非完全字段名对应的数据值缺失的第二样本集合;根据第一样本集合中与非完全字段名相关的完全字段名对应的数据值构建预测模型;将第二样本集合中各个样本在完全字段名对应的数据值输入至预测模型中,通过预测模型输出第二样本集合中各个样本在非完全字段名下的预测值;将预测值作为待 填补的缺失值。In one of the embodiments, the missing type is random missing; the missing value calculation module 608 is further used to determine the first sample set where the data value corresponding to the incomplete field name in the data table exists and the missing data value corresponding to the incomplete field name The second sample set of; build a prediction model based on the data values corresponding to the full field names in the first sample set related to the incomplete field names; input the data values corresponding to the full field names of the samples in the second sample set into the prediction In the model, the predicted value of each sample in the second sample set under the name of the incomplete field is output through the prediction model; the predicted value is used as the missing value to be filled.
在其中一个实施例中,数据表填补装置600还包括相关度计算模块;相关度计算模块用于统计数据表中各个字段名对应的均值和标准差;根据均值和标准差,按照以下公式计算任意两个字段名之间的相关度:In one of the embodiments, the data table filling device 600 further includes a correlation calculation module; the correlation calculation module is used to count the mean and standard deviation corresponding to each field name in the data table; according to the mean and standard deviation, calculate any according to the following formula The correlation between the two field names:
Figure PCTCN2019122323-appb-000005
Figure PCTCN2019122323-appb-000005
ρ (x,y)表示字段名X与字段名Y之间的相关度;μ X表示字段名X对应的均值;μ Y表示字段名Y对应的均值;σ X表示字段名X对应的标准差;σ Y表示字段名Y对应的标准差;E[(X-μ X)(Y-μ Y)]是Z的期望值,Z=(X-μ X)(Y-μ Y)。 ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the mean value corresponding to the field name X; μ Y represents the mean value corresponding to the field name Y; σ X represents the standard deviation corresponding to the field name X ; Σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
上述数据表填补装置600,在获取到用户上传的数据表时,就确定该数据表中缺失了数据值的非完全字段名,按照该非完全字段名与数据表中其它字段名之间的相关度确定该非完全字段名的缺失类型,然后根据数据表中已有的数据值按照该非完全字段名的缺失类型所对应的填补方式计算该非完全字段名对应的缺失值,就可以用该缺失值填补该非完全字段名缺失的数据值,按照上述步骤,可以填补数据表中各个非完全字段名缺失的数据值,能够有效填补数据表,这样,基于填补后的数据表所进行的数据分析的准确度也会显著提高。The above data table filling device 600, when acquiring the data table uploaded by the user, determines that the incomplete field name of the data value is missing in the data table, and according to the correlation between the incomplete field name and other field names in the data table Determine the missing type of the incomplete field name, and then calculate the missing value corresponding to the incomplete field name according to the padding method corresponding to the missing type of the incomplete field name according to the existing data values in the data table. Missing values are used to fill in the missing data values of the incomplete field names. According to the above steps, the missing data values of each incomplete field name in the data table can be filled, and the data table can be effectively filled. The accuracy of the analysis will also be significantly improved.
关于数据表填补装置600的具体限定可以参见上文中对于数据表填补方法的限定,在此不再赘述。上述数据表填补装置600中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data table filling device 600, reference may be made to the limitation on the method of filling the data table above, and details are not described herein again. Each module in the above data table filling device 600 may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和输入装置。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据表填补方法。该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and an internal structure diagram thereof may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface, and an input device connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer-readable instructions are executed by the processor to implement a data table filling method. The input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball, or a touch pad provided on the computer device shell, or an external keyboard, touch pad, or mouse.
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
在一个实施例中,本申请提供的数据表填补装置可以实现为一种计算机可读指令的形 式,计算机可读指令可在如图7所示的计算机设备上运行。计算机设备的存储器中可存储组成该数据表填补装置600的各个程序模块,比如,图6所示的数据表获取模块602、非完全字段名确定模块604、缺失类型确定模块606、缺失值计算模块608和填补模块610。各个程序模块构成的计算机可读指令使得处理器执行本说明书中描述的本申请各个实施例的数据表填补方法中的步骤。In one embodiment, the data table filling apparatus provided by the present application may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on a computer device as shown in FIG. 7. The memory of the computer device may store various program modules constituting the data table filling device 600, for example, the data table acquisition module 602, the incomplete field name determination module 604, the missing type determination module 606, and the missing value calculation module shown in FIG. 608 and fill module 610. The computer-readable instructions formed by the various program modules cause the processor to execute the steps in the data table filling method described in each embodiment of the present application described in this specification.
例如,图7所示的计算机设备可以通过如图6所示的数据表填补装置600中的数据表获取模块执行步骤S202。计算机设备可通过非完全字段名确定模块执行步骤S204。计算机设备可通过缺失类型确定模块执行步骤S206。计算机设备可通过缺失值计算模块执行步骤S208。计算机设备可通过填补模块执行步骤S210。For example, the computer device shown in FIG. 7 may execute step S202 through the data table acquisition module in the data table filling apparatus 600 shown in FIG. 6. The computer device may execute step S204 through the incomplete field name determination module. The computer device may execute step S206 through the missing type determination module. The computer device may execute step S208 through the missing value calculation module. The computer device may execute step S210 through the filling module.
在一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行上述数据表填补方法的步骤。此处数据表填补方法的步骤可以是上述各个实施例的数据表填补方法中的步骤。In one embodiment, a computer device is provided, which includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed The steps of the above data table filling method. Here, the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述数据表填补方法的步骤。此处数据表填补方法的步骤可以是上述各个实施例的数据表填补方法中的步骤。In one embodiment, one or more non-volatile computer-readable storage media storing computer-readable instructions are provided. When the computer-readable instructions are executed by one or more processors, the one or more processors Perform the steps of the above data table filling method. Here, the steps of the data table padding method may be the steps in the data table padding methods of the foregoing embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art may understand that all or part of the processes in the method of the above embodiments may be completed by instructing relevant hardware through a computer program, and the computer program may be stored in a non-volatile computer readable storage In the medium, when the computer program is executed, the process of the foregoing method embodiments may be included. Any references to memory, storage, databases, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be arbitrarily combined. In order to simplify the description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the scope described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementations of the present application, and their descriptions are more specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can also be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种数据表填补方法,包括:A data table filling method, including:
    获取用户上传的数据表;Obtain the data table uploaded by the user;
    确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
    根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
    根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及根据所述缺失值填补所述非完全字段名缺失的数据值。Calculating the missing value according to the existing data value in the data table according to the filling method corresponding to the missing type; and filling the missing data value of the incomplete field name according to the missing value.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型,包括:The method according to claim 1, wherein the determining the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table includes:
    当所述非完全字段名与所述数据表中其它字段名之间的相关度均小于第一预设值时,则确定所述非完全字段名的缺失类型为完全随机缺失;When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;
    当所述非完全字段名与所述数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定所述非完全字段名的缺失类型为随机缺失;及When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and
    当所述非完全字段名与所述数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定所述非完全字段名的缺失类型为非随机缺失。When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is a non-random missing.
  3. 根据权利要求1所述的方法,其特征在于,所述缺失类型为完全随机缺失;所述根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值,包括:The method according to claim 1, wherein the missing type is a completely random missing; the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes: :
    当所述非完全字段名对应的数据值类型为字符型时,则根据所述非完全字段名已有的数据值统计相应的中位数,将统计的所述中位数作为所述非完全字段名对应的缺失值;或,根据所述非完全字段名已有的数据值统计相应的众数,将统计的所述众数作为所述非完全字段名对应的缺失值;及When the type of the data value corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, counting the corresponding mode according to the existing data value of the incomplete field name, and using the statistical mode as the missing value corresponding to the incomplete field name; and
    当所述非完全字段名对应的数据值类型为数值型时,则根据所述非完全字段名已有的数据值统计相应的平均数,将统计的所述平均数作为所述非完全字段名对应的缺失值。When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.
  4. 根据权利要求1所述的方法,其特征在于,所述缺失类型为完全随机缺失;所述根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值,包括:The method according to claim 1, wherein the missing type is a completely random missing; the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes: :
    确定所述数据表中缺失了所述非完全字段名对应的数据值的第一类样本;Determining that the first type sample of the data value corresponding to the incomplete field name is missing from the data table;
    确定所述数据表中所述非完全字段名对应的数据值存在的第二类样本;Determining a second type of sample in which the data value corresponding to the incomplete field name in the data table exists;
    统计所述第一类样本的样本数量;Count the number of samples of the first type of samples;
    计算所述样本数量占所述样本总数的比例;及Calculating the ratio of the number of samples to the total number of samples; and
    当所述比例大于阈值时,则将所述第一类样本在所述非完全字段名下的数据值替换为第一值;将所述第二类样本在所述非完全字段名下的数据值替换为第二值。When the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced with the first value; the data of the second type of sample under the incomplete field name Replace the value with the second value.
  5. 根据权利要求1所述的方法,其特征在于,所述缺失类型为随机缺失;所述根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值,包括:The method according to claim 1, wherein the missing type is a random missing; and the calculating the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
    确定与所述非完全字段名相关的完全字段名;Determine the complete field name related to the incomplete field name;
    按照所述完全字段名的数据值对所述数据表中的样本进行聚类,得到聚类簇;Clustering the samples in the data table according to the data value of the complete field name to obtain a clustering cluster;
    确定所述数据表中缺失了所述非完全字段名对应的数据值的第三类样本;及Determining that the third type sample of the data value corresponding to the incomplete field name is missing from the data table; and
    计算所述第三类样本所属的聚类簇所包括样本在所述非完全字段名下的均值,将计算得到的均值作为待填补的缺失值。Calculate the average value of the samples included in the cluster of the third type sample under the name of the incomplete field, and use the calculated average value as the missing value to be filled.
  6. 根据权利要求1所述的方法,其特征在于,所述缺失类型为随机缺失;所述根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值,包括:The method according to claim 1, wherein the missing type is a random missing; and the calculating of the missing value according to the filling method corresponding to the missing type according to the existing data value in the data table includes:
    确定所述数据表中所述非完全字段名对应的数据值存在的第一样本集合以及所述非完全字段名对应的数据值缺失的第二样本集合;Determining a first sample set where the data value corresponding to the incomplete field name in the data table exists and a second sample set where the data value corresponding to the incomplete field name is missing;
    根据所述第一样本集合中与所述非完全字段名相关的完全字段名对应的数据值构建预测模型;Construct a prediction model according to the data value corresponding to the full field name in the first sample set related to the incomplete field name;
    将所述第二样本集合中各个样本在所述完全字段名对应的数据值输入至所述预测模型中,通过所述预测模型输出所述第二样本集合中各个样本在所述非完全字段名下的预测值;及Input data values corresponding to the complete field names of each sample in the second sample set into the prediction model, and output each sample in the second sample set in the incomplete field name through the prediction model Predicted value; and
    将所述预测值作为待填补的缺失值。Use the predicted value as the missing value to be filled.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 6, further comprising:
    统计所述数据表中各个字段名对应的均值和标准差;及Count the mean and standard deviation corresponding to each field name in the data table; and
    根据所述均值和标准差,按照以下公式计算任意两个字段名之间的相关度:According to the mean and standard deviation, the correlation between any two field names is calculated according to the following formula:
    Figure PCTCN2019122323-appb-100001
    Figure PCTCN2019122323-appb-100001
    其中,ρ (x,y)表示字段名X与字段名Y之间的相关度;μ X表示字段名X对应的均值;μ Y表示字段名Y对应的均值;σ X表示字段名X对应的标准差;σ Y表示字段名Y对应的标准差;E[(X-μ X)(Y-μ Y)]是Z的期望值,Z=(X-μ X)(Y-μ Y)。 Where ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the average value corresponding to the field name X; μ Y represents the average value corresponding to the field name Y; σ X represents the corresponding value corresponding to the field name X Standard deviation; σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
  8. 一种数据表填补装置,包括:A data table filling device, including:
    数据表获取模块,用于获取用户上传的数据表;The data table acquisition module is used to obtain the data table uploaded by the user;
    非完全字段名确定模块,用于确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;An incomplete field name determination module, configured to determine an incomplete field name in the data table, the incomplete field name is missing a data value;
    缺失类型确定模块,用于根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;A missing type determining module, configured to determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
    缺失值计算模块,用于根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及A missing value calculation module, configured to calculate the missing value according to the existing data value in the data table and according to the filling method corresponding to the missing type; and
    填补模块,用于根据所述缺失值填补所述非完全字段名缺失的数据值。The padding module is used to fill in the missing data value of the incomplete field name according to the missing value.
  9. 根据权利要求8所述的装置,其特征在于,所述缺失类型确定模块,还用于当所述非完全字段名与所述数据表中其它字段名之间的相关度均小于第一预设值时,则确定所述非完全字段名的缺失类型为完全随机缺失;当所述非完全字段名与所述数据表中至少一 个完全字段名之间的相关度大于第二预设值时,则确定所述非完全字段名的缺失类型为随机缺失;及当所述非完全字段名与所述数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定所述非完全字段名的缺失类型为非随机缺失。The apparatus according to claim 8, wherein the missing type determination module is further used when the correlation between the incomplete field name and other field names in the data table is less than the first preset Value, it is determined that the missing type of the incomplete field name is completely random missing; when the correlation between the incomplete field name and at least one complete field name in the data table is greater than the second preset value, Determining that the missing type of the incomplete field name is random missing; and when the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined The missing type of the incomplete field name is non-random missing.
  10. 根据权利要求8所述的装置,其特征在于,所述缺失值计算模块,还用于当所述非完全字段名对应的数据值类型为字符型时,则根据所述非完全字段名已有的数据值统计相应的中位数,将统计的所述中位数作为所述非完全字段名对应的缺失值;或,根据所述非完全字段名已有的数据值统计相应的众数,将统计的所述众数作为所述非完全字段名对应的缺失值;及当所述非完全字段名对应的数据值类型为数值型时,则根据所述非完全字段名已有的数据值统计相应的平均数,将统计的所述平均数作为所述非完全字段名对应的缺失值。The apparatus according to claim 8, wherein the missing value calculation module is further configured to, when the data value type corresponding to the incomplete field name is character type, based on the incomplete field name The corresponding median of the data value of is counted, and the median is counted as the missing value corresponding to the incomplete field name; or, the corresponding mode is counted according to the existing data value of the incomplete field name, Using the counted mode as the missing value corresponding to the incomplete field name; and when the data value type corresponding to the incomplete field name is numeric, the existing data value according to the incomplete field name is used Count the corresponding average number, and use the averaged number as the missing value corresponding to the incomplete field name.
  11. 根据权利要求8所述的装置,其特征在于,所述缺失类型为完全随机缺失;所述缺失值计算模块,还用于确定所述数据表中缺失了所述非完全字段名对应的数据值的第一类样本;确定所述数据表中所述非完全字段名对应的数据值存在的第二类样本;统计所述第一类样本的样本数量;计算所述样本数量占所述样本总数的比例;及当所述比例大于阈值时,则将所述第一类样本在所述非完全字段名下的数据值替换为第一值;将所述第二类样本在所述非完全字段名下的数据值替换为第二值。The device according to claim 8, wherein the missing type is completely random missing; the missing value calculation module is further used to determine that the data value corresponding to the incomplete field name is missing from the data table Samples of the first type; determine the samples of the second type that exist in the data values corresponding to the incomplete field names in the data table; count the number of samples of the first type of samples; The ratio of; and when the ratio is greater than the threshold, the data value of the first type of sample under the incomplete field name is replaced by the first value; the second type of sample is in the incomplete field The data value under the name is replaced with the second value.
  12. 根据权利要求8所述的装置,其特征在于,所述缺失类型为随机缺失;所述缺失值计算模块,还用于确定与所述非完全字段名相关的完全字段名;按照所述完全字段名的数据值对所述数据表中的样本进行聚类,得到聚类簇;确定所述数据表中缺失了所述非完全字段名对应的数据值的第三类样本;及计算所述第三类样本所属的聚类簇所包括样本在所述非完全字段名下的均值,将计算得到的均值作为待填补的缺失值。The device according to claim 8, wherein the missing type is random missing; the missing value calculation module is further used to determine a complete field name related to the incomplete field name; according to the complete field The data values of the name are clustered on the samples in the data table to obtain a clustering cluster; it is determined that the third type sample of the data value corresponding to the incomplete field name is missing from the data table; and calculating the first The average value of the samples included in the clusters of the three types of samples under the name of the incomplete field, and the calculated average value is used as the missing value to be filled.
  13. 根据权利要求8所述的装置,其特征在于,所述缺失类型为随机缺失;所述缺失值计算模块,还用于确定所述数据表中所述非完全字段名对应的数据值存在的第一样本集合以及所述非完全字段名对应的数据值缺失的第二样本集合;根据所述第一样本集合中与所述非完全字段名相关的完全字段名对应的数据值构建预测模型;将所述第二样本集合中各个样本在所述完全字段名对应的数据值输入至所述预测模型中,通过所述预测模型输出所述第二样本集合中各个样本在所述非完全字段名下的预测值;及将所述预测值作为待填补的缺失值。The apparatus according to claim 8, wherein the missing type is a random missing; the missing value calculation module is further used to determine the first data value corresponding to the incomplete field name in the data table. A sample set and a second sample set with missing data values corresponding to the incomplete field names; constructing a prediction model based on the data values corresponding to the complete field names in the first sample set related to the incomplete field names ; Input the data value corresponding to the complete field name of each sample in the second sample set into the prediction model, and output each sample in the second sample set in the incomplete field through the prediction model The predicted value under the name; and use the predicted value as the missing value to be filled.
  14. 根据权利要求8至13任一项所述的装置,其特征在于,所述装置还包括相关度计算模块,用于统计所述数据表中各个字段名对应的均值和标准差;及根据所述均值和标准差,按照以下公式计算任意两个字段名之间的相关度:The device according to any one of claims 8 to 13, wherein the device further comprises a correlation calculation module for counting the mean and standard deviation corresponding to each field name in the data table; and according to the For the mean and standard deviation, calculate the correlation between any two field names according to the following formula:
    Figure PCTCN2019122323-appb-100002
    Figure PCTCN2019122323-appb-100002
    其中,ρ (x,y)表示字段名X与字段名Y之间的相关度;μ X表示字段名X对应的均值;μ Y表示字段名Y对应的均值;σ X表示字段名X对应的标准差;σ Y表示字段名Y对应的标 准差;E[(X-μ X)(Y-μ Y)]是Z的期望值,Z=(X-μ X)(Y-μ Y)。 Where ρ (x, y) represents the correlation between the field name X and the field name Y; μ X represents the average value corresponding to the field name X; μ Y represents the average value corresponding to the field name Y; σ X represents the corresponding value corresponding to the field name X Standard deviation; σ Y represents the standard deviation corresponding to the field name Y; E[(X-μ X )(Y-μ Y )] is the expected value of Z, Z=(X-μ X )(Y-μ Y ).
  15. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:
    获取用户上传的数据表;Obtain the data table uploaded by the user;
    确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
    根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
    根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and
    根据所述缺失值填补所述非完全字段名缺失的数据值。The missing data value of the incomplete field name is filled according to the missing value.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device of claim 15, wherein the processor further executes the following steps when executing the computer-readable instructions:
    当所述非完全字段名与所述数据表中其它字段名之间的相关度均小于第一预设值时,则确定所述非完全字段名的缺失类型为完全随机缺失;When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;
    当所述非完全字段名与所述数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定所述非完全字段名的缺失类型为随机缺失;及When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and
    当所述非完全字段名与所述数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定所述非完全字段名的缺失类型为非随机缺失。When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is non-random missing.
  17. 根据权利要求15所述的计算机设备,其特征在于,所述缺失类型为完全随机缺失;所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device according to claim 15, wherein the type of deletion is a completely random deletion; the processor also executes the following steps when executing the computer-readable instructions:
    当所述非完全字段名对应的数据值类型为字符型时,则根据所述非完全字段名已有的数据值统计相应的中位数,将统计的所述中位数作为所述非完全字段名对应的缺失值;或,根据所述非完全字段名已有的数据值统计相应的众数,将统计的所述众数作为所述非完全字段名对应的缺失值;及When the data value type corresponding to the incomplete field name is a character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, counting the corresponding mode according to the existing data value of the incomplete field name, and using the statistical mode as the missing value corresponding to the incomplete field name; and
    当所述非完全字段名对应的数据值类型为数值型时,则根据所述非完全字段名已有的数据值统计相应的平均数,将统计的所述平均数作为所述非完全字段名对应的缺失值。When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.
  18. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取用户上传的数据表;Obtain the data table uploaded by the user;
    确定所述数据表中的非完全字段名,所述非完全字段名缺少数据值;Determining an incomplete field name in the data table, the incomplete field name is missing a data value;
    根据所述非完全字段名与所述数据表中其它字段名之间的相关度确定所述非完全字段名的缺失类型;Determine the missing type of the incomplete field name according to the correlation between the incomplete field name and other field names in the data table;
    根据所述数据表中已有的数据值,根据所述缺失类型对应的填补方式计算缺失值;及Calculating the missing value according to the existing data value in the data table, according to the filling method corresponding to the missing type; and
    根据所述缺失值填补所述非完全字段名缺失的数据值。The missing data value of the incomplete field name is filled according to the missing value.
  19. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium according to claim 18, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:
    当所述非完全字段名与所述数据表中其它字段名之间的相关度均小于第一预设值时,则确定所述非完全字段名的缺失类型为完全随机缺失;When the correlation between the incomplete field name and other field names in the data table is less than the first preset value, it is determined that the missing type of the incomplete field name is completely random missing;
    当所述非完全字段名与所述数据表中至少一个完全字段名之间的相关度大于第二预设值时,则确定所述非完全字段名的缺失类型为随机缺失;及When the degree of correlation between the incomplete field name and at least one complete field name in the data table is greater than a second preset value, it is determined that the missing type of the incomplete field name is random missing; and
    当所述非完全字段名与所述数据表中至少一个非完全字段名之间的相关度大于第三预设值时,则确定所述非完全字段名的缺失类型为非随机缺失。When the correlation between the incomplete field name and at least one incomplete field name in the data table is greater than a third preset value, it is determined that the missing type of the incomplete field name is a non-random missing.
  20. 根据权利要求18所述的存储介质,其特征在于,所述缺失类型为完全随机缺失;所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium according to claim 18, wherein the type of deletion is a completely random deletion; when the computer-readable instructions are executed by the processor, the following steps are also performed:
    当所述非完全字段名对应的数据值类型为字符型时,则根据所述非完全字段名已有的数据值统计相应的中位数,将统计的所述中位数作为所述非完全字段名对应的缺失值;或,根据所述非完全字段名已有的数据值统计相应的众数,将统计的所述众数作为所述非完全字段名对应的缺失值;及When the type of the data value corresponding to the incomplete field name is character type, the corresponding median is counted according to the existing data value of the incomplete field name, and the statistical median is regarded as the incomplete The missing value corresponding to the field name; or, based on the existing data value of the incomplete field name, the corresponding mode is counted, and the statistical mode is used as the missing value corresponding to the incomplete field name; and
    当所述非完全字段名对应的数据值类型为数值型时,则根据所述非完全字段名已有的数据值统计相应的平均数,将统计的所述平均数作为所述非完全字段名对应的缺失值。When the type of the data value corresponding to the incomplete field name is numeric, the corresponding average is counted according to the existing data value of the incomplete field name, and the average number counted is used as the incomplete field name Corresponding missing value.
PCT/CN2019/122323 2019-01-02 2019-12-02 Data table filling method, apparatus, computer device, and storage medium WO2020140662A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910001784.2 2019-01-02
CN201910001784.2A CN109783788A (en) 2019-01-02 2019-01-02 Tables of data complementing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020140662A1 true WO2020140662A1 (en) 2020-07-09

Family

ID=66499820

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122323 WO2020140662A1 (en) 2019-01-02 2019-12-02 Data table filling method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN109783788A (en)
WO (1) WO2020140662A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783788A (en) * 2019-01-02 2019-05-21 深圳壹账通智能科技有限公司 Tables of data complementing method, device, computer equipment and storage medium
CN110570229A (en) * 2019-07-30 2019-12-13 平安科技(深圳)有限公司 User information processing method and device, computer equipment and storage medium
CN112036492B (en) * 2020-09-01 2024-02-02 腾讯科技(深圳)有限公司 Sample set processing method, device, equipment and storage medium
CN112734566A (en) * 2021-01-19 2021-04-30 中国农业银行股份有限公司 Credit limit acquisition method and device and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025531A (en) * 2010-08-16 2011-04-20 北京亿阳信通软件研究院有限公司 Filling method and device thereof for performance data
CN102486790A (en) * 2010-12-02 2012-06-06 财团法人资讯工业策进会 System and method for filling data missing value
CN103246702A (en) * 2013-04-02 2013-08-14 大连理工大学 Industrial sequential data missing filling method based on sectional state displaying
CN107203774A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 The method and device that the belonging kinds of data are predicted
CN109783788A (en) * 2019-01-02 2019-05-21 深圳壹账通智能科技有限公司 Tables of data complementing method, device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893766B (en) * 2016-04-06 2022-03-18 成都数联易康科技有限公司 Grading diagnosis and treatment evaluation method based on data mining
CN107193876B (en) * 2017-04-21 2020-10-09 美林数据技术股份有限公司 Missing data filling method based on nearest neighbor KNN algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025531A (en) * 2010-08-16 2011-04-20 北京亿阳信通软件研究院有限公司 Filling method and device thereof for performance data
CN102486790A (en) * 2010-12-02 2012-06-06 财团法人资讯工业策进会 System and method for filling data missing value
CN103246702A (en) * 2013-04-02 2013-08-14 大连理工大学 Industrial sequential data missing filling method based on sectional state displaying
CN107203774A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 The method and device that the belonging kinds of data are predicted
CN109783788A (en) * 2019-01-02 2019-05-21 深圳壹账通智能科技有限公司 Tables of data complementing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN109783788A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
WO2020140662A1 (en) Data table filling method, apparatus, computer device, and storage medium
Schepsmeier A goodness-of-fit test for regular vine copula models
WO2020057021A1 (en) Data table processing method and device, computer device and storage medium
WO2019100682A1 (en) Risk assessment method and device, computer device and computer-readable storage medium
US11354436B2 (en) Method and apparatus for de-identification of personal information
US10282360B2 (en) Uniform chart formatting based on semantics in data models
Acosta et al. A flexible statistical framework for estimating excess mortality
WO2020119098A1 (en) Health evaluation method and apparatus, and computer readable storage medium
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
WO2020151321A1 (en) Graph computation-based claim anti-fraud method, apparatus and device, and storage medium
WO2023045504A1 (en) Query processing method and apparatus
WO2021139112A1 (en) Data dimensionality reduction processing method and apparatus, computer device, and storage medium
Brown et al. A novel approach for propensity score matching and stratification for multiple treatments: Application to an electronic health record–derived study
Friedman Contrast trees and distribution boosting
Yu et al. Asymptotic properties and information criteria for misspecified generalized linear mixed models
Batsidis et al. A necessary power divergence-type family of tests for testing elliptical symmetry
Barrientos et al. Bayesian bootstraps for massive data
WO2019080419A1 (en) Method for building standard knowledge base, electronic device, and storage medium
WO2019019753A1 (en) Judgement method and apparatus for providing health report, computer device and storage medium
WO2020119151A1 (en) Health evaluation method, health evaluation device, and computer readable storage medium
US10891268B2 (en) Methods and system for determining a most reliable record
CN110795475A (en) Report generation method and device, computer equipment and storage medium
Lando et al. Measuring the citation impact of journals with generalized Lorenz curves
Plunus et al. Measuring operational risk in financial institutions
Stampfer et al. Methods for estimating principal points

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19906775

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19906775

Country of ref document: EP

Kind code of ref document: A1