CN109783788A

CN109783788A - Tables of data complementing method, device, computer equipment and storage medium

Info

Publication number: CN109783788A
Application number: CN201910001784.2A
Authority: CN
Inventors: 蔡健; 杨镭; 黄北辰; 郭凌峰; 付晓
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2019-05-21
Also published as: WO2020140662A1

Abstract

This application involves a kind of tables of data complementing method, device, computer equipment and storage mediums.The tables of data complementing method is related to technical field of data processing, which comprises obtains the tables of data that user uploads；Determine the non-fully field name in tables of data, non-fully field name lacks data value；The deletion type of non-fully field name is determined according to the degree of correlation between field names other in non-fully field name and tables of data；According to data value existing in tables of data, missing values are calculated according to the corresponding mode of filling up of deletion type；The data value of non-fully field name missing is filled up according to missing values.The data value that each non-fully field name missing in tables of data can be filled up using this programme, can effectively fill up tables of data, in this way, can also be significantly improved based on the accuracy that the data that the tables of data after filling up is carried out are analyzed.

Description

Tables of data complementing method, device, computer equipment and storage medium

Technical field

This application involves field of computer technology, more particularly to a kind of tables of data complementing method, device, computer equipment And storage medium.

Background technique

Report data is the data in tables of data, is one of the form of the most common data in practical application, can be used for into The analysis of row data or generation report show user, such as loan transaction data, human resource data, insurance business data etc.. However, these report datas are inevitably since misoperation, the system failure, human factor etc. lead to the missing of data value.

In existing business data report platform, the data value lacked in tables of data will not usually be handled, or Person directly deletes the sample for having lacked data value, in this way, frequently can lead to form the distribution of report data in entire tables of data Interference influences the accuracy of data analysis.

Summary of the invention

Based on this, it is necessary to which in view of the above technical problems, the data lacked in tables of data can effectively be filled up by providing one kind Tables of data complementing method, device, computer equipment and the storage medium of value.

A kind of tables of data complementing method, which comprises

Obtain the tables of data that user uploads；

Determine that the non-fully field name in the tables of data, the non-fully field name lack data value；

It is determined according to the degree of correlation in the non-fully field name and the tables of data between other field names described non-complete The deletion type of full field name；

According to data value existing in the tables of data, missing is calculated according to the corresponding mode of filling up of the deletion type Value；

The data value of the non-fully field name missing is filled up according to the missing values.

The non-fully field name packet of data value has been lacked in the determination tables of data in one of the embodiments, It includes:

Count the quantity of the corresponding data value of each field name in the tables of data；

Determine the corresponding total sample number of the tables of data；

When the quantity is less than the total sample number, the field name is determined as non-fully field name.

In one of the embodiments, it is described according to non-fully in field name and the tables of data other field names it Between the degree of correlation determine that the deletion type of the non-fully field name includes:

It is preset when the degree of correlation in the non-fully field name and the tables of data between other field names is respectively less than first When value, it is determined that the deletion type of the non-fully field name is completely random missing；

When the degree of correlation in the non-fully field name and the tables of data between at least one complete field name is greater than the When two preset values, it is determined that the deletion type of the non-fully field name is missing at random；

When the degree of correlation between at least one in the non-fully field name and the tables of data non-fully field name is greater than When third preset value, it is determined that the deletion type of the non-fully field name is Missing.

The deletion type is completely random missing in one of the embodiments,；It is described according in the tables of data Some data values, according to the deletion type it is corresponding fill up mode and calculate missing values include:

When the corresponding data Value Types of the non-fully field name are character type, then according to the non-fully field name Some corresponding medians of data Data-Statistics, using the median of statistics as the non-fully corresponding missing of field name Value；Or, according to the existing corresponding mode of data Data-Statistics of the non-fully field name, using the mode of statistics as described in The non-fully corresponding missing values of field name；

When the corresponding data Value Types of the non-fully field name are numeric type, then according to the non-fully field name Some corresponding average of data Data-Statistics, using the average of statistics as the non-fully corresponding missing of field name Value.

Determine the first kind sample that the corresponding data value of the non-fully field name has been lacked in the tables of data；

Determine described in the tables of data the second class sample existing for the non-fully corresponding data value of field name；

Count the sample size of the first kind sample；

Calculate the ratio that the sample size accounts for the total sample number；

When the ratio is greater than threshold value, then data value of the first kind sample under the non-fully field name is replaced It is changed to the first value；Data value of the second class sample under the non-fully field name is replaced with into second value.

The deletion type is missing at random in one of the embodiments,；It is described according to existing in the tables of data Data value, according to the deletion type it is corresponding fill up mode and calculate missing values include:

Determine complete field name relevant to the non-fully field name；

The sample in the tables of data is clustered according to the data value of the complete field name, obtains clustering cluster；

Determine the third class sample that the corresponding data value of the non-fully field name has been lacked in the tables of data；

Mean value of the sample included by clustering cluster belonging to the third class sample under the non-fully field name is calculated, it will The mean value being calculated is as missing values to be filled up.

Determine described in the tables of data first sample set and institute existing for the non-fully corresponding data value of field name State the second sample set of the corresponding data value missing of non-fully field name；

According to the corresponding data value of complete field name relevant to the non-fully field name in the first sample set Construct prediction model；

Each sample in second sample set is input in the corresponding data value of the complete field name described pre- It surveys in model, each sample is exported in second sample set under the non-fully field name by the prediction model Predicted value；

Using the predicted value as missing values to be filled up.

In one of the embodiments, the method also includes:

Count the corresponding mean value of each field name in the tables of data and standard deviation；

According to the mean value and standard deviation, the degree of correlation between any two field name is calculated according to following formula:

Wherein, ρ_(x,y)Indicate the degree of correlation between field name X and field name Y；μ_XIndicate the corresponding mean value of field name X；μ_Y Indicate the corresponding mean value of field name Y；σ_XIndicate the corresponding standard deviation of field name X；σ_YIndicate the corresponding standard deviation of field name Y；E [(X-μ_X)(Y-μ_Y)] be Z desired value, Z=(X- μ_X)(Y-μ_Y)。

A kind of tables of data fills up device, and described device includes:

Tables of data obtains module, for obtaining the tables of data of user's upload；

Non-fully field name determining module, for determining the non-fully field name in the tables of data, the non-fully word Section name lacks data value；

Deletion type determining module, between other field names in the basis non-fully field name and the tables of data The degree of correlation determine the deletion type of the non-fully field name；

Missing values computing module, for being corresponded to according to the deletion type according to data value existing in the tables of data Mode of filling up calculate missing values；

Module is filled up, for filling up the data value of the non-fully field name missing according to the missing values.

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor perform the steps of when executing the computer program

Obtain the tables of data that user uploads；

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor It is performed the steps of when row

Obtain the tables of data that user uploads；

Above-mentioned tables of data complementing method, device, computer equipment and storage medium, in the tables of data for getting user's upload When, the non-fully field name that data value has been lacked in the tables of data is determined that, according to its in this non-fully field name and tables of data The degree of correlation between its field name determines the deletion type of the non-fully field name, then according to data value existing in tables of data The non-fully corresponding missing values of field name are calculated according to mode is filled up corresponding to this non-fully deletion type of field name, just The data value that non-fully field name lacks can be filled up with the missing values, according to above-mentioned steps, can fill up each in tables of data The data value of a non-fully field name missing, can effectively fill up tables of data, in this way, carried out based on the tables of data after filling up The accuracy of data analysis can also significantly improve.

Detailed description of the invention

Fig. 1 is the application scenario diagram of tables of data complementing method in one embodiment；

Fig. 2 is the flow diagram of tables of data complementing method in one embodiment；

Fig. 3 is to fill up mode according to deletion type is corresponding according to data value existing in tables of data in one embodiment The flow diagram for the step of calculating missing values；

Fig. 4 is in another embodiment according to data value existing in tables of data, according to the corresponding side of filling up of deletion type Formula calculates the flow diagram of the step of missing values；

Fig. 5 is in another embodiment according to data value existing in tables of data, according to the corresponding side of filling up of deletion type Formula calculates the flow diagram of the step of missing values；

Fig. 6 is the structural block diagram that tables of data fills up device in one embodiment；

Fig. 7 is the internal structure chart of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.

Tables of data complementing method provided by the present application, can be applied in application environment as shown in Figure 1.Wherein, terminal 102 are communicated with server 104 by network by network.Terminal 102 can obtain the tables of data of user's upload, by tables of data It is sent to server 104, the degree of correlation included by tables of data between field name is calculated as server 104, and by any two word Relevance feedback between section name has been lacked the non-fully word of data value by terminal 102 to terminal 102 in tables of data has been determined Duan Hou determines that non-fully field name lacks for this according to the degree of correlation between other field names in this non-fully field name and tables of data Lose type.Terminal 102 can also be further according to data value existing in tables of data, according to the deletion type of the non-fully field name Corresponding mode of filling up calculates missing values, and the data value of non-fully field name missing is filled up according to missing values, after filling up Tables of data is sent to server 104.Wherein, terminal 102 can be, but not limited to be various personal computers, laptop, intelligence Energy mobile phone, tablet computer and portable wearable device, server 104 can use independent server either multiple servers The server cluster of composition is realized.

In one embodiment, as shown in Fig. 2, providing a kind of tables of data complementing method, it is applied in Fig. 1 in this way Terminal for be illustrated, comprising the following steps:

Step 202, the tables of data that user uploads is obtained.

Wherein, tables of data is a kind of data form of structuring, for example can be CSV (comma separated value, Comma- Separated Values) format table, CSV tables of data with plain text store list data, the list data of storage Including numeric type and character type.Specifically, it is possible to provide web interface, user upload tables of data by the web interface, and terminal is just The tables of data of user's upload can be obtained.In one embodiment, each user need to be raw by preset file format or form template At the tables of data comprising report data, so that terminal can parse the table structure information of the tables of data of upload.

It as shown in table 1 below, is the schematic diagram of the tables of data of the CSV format uploaded in one embodiment.

Table 1

It is separated between the element of every a line with comma from can be seen that in upper table 1 in the tables of data, the element of the first row For indicating the column name of this column, it is also the gauge outfit or field name of tables of data, the element in the corresponding column is corresponding for field name Data value, a field name corresponded to multiple data values.From the second row, the data of every a line indicate one in the tables of data A sample shows 4 samples in upper table 1.

Step 204, the non-fully field name in tables of data is determined, non-fully field name lacks data value.

Wherein, non-fully field name is the field name that data value has been lacked in tables of data, and correspondingly, complete field name is several According to the not field name of missing data value in table.For example, in upper table 1, belong to non-fully field name field name include: educational background, The amount of the loan, the field name for belonging to complete field name includes: name, gender, age, area, loan time and ID card No..

Specifically, terminal can determine after the tables of data for getting user's upload and lack each of data value in tables of data A field name, i.e., each non-fully field name.

In one embodiment, determine that the non-fully field name in tables of data includes: each field name in statistics table The quantity of corresponding data value；Determine the corresponding total sample number of tables of data；When quantity is less than total sample number, field name is determined For non-fully field name.

Specifically, for field name included in tables of data, the corresponding data value of the statistics available each field name of terminal Quantity, and the sum of sample included by statistics table, when the quantity of the corresponding data value of the field name of statistics is less than sample When total, illustrate that the field name has lacked data value, then the field name is determined as non-fully field name.

For example, in the table 1 being mentioned above, process of the terminal in the quantity of the corresponding data value of traversal field name " educational background " In, as soon as often inquiring a data value of non-" sky " (NULL), corresponding quantity increases 1, until having traversed all in tables of data Sample, the quantity of the corresponding data value of the field name counted " educational background " is " 3 ", and total sample number is " 4 ", therefore be can determine Field name " educational background " is non-fully field name.Similarly, it also can determine field name " amount of the loan " for non-fully field name.

Step 206, non-fully word is determined according to the degree of correlation between field names other in non-fully field name and tables of data The deletion type of section name.

Wherein, the degree of correlation can indicate the implicit connection between two field names.The degree of correlation between two field names is bigger, The connection represented between the two field names is stronger, conversely, the degree of correlation between two field names is smaller, represents the two fields Connection between name is weaker.For example, in " area " where creditor, Beijing, Shenzhen, Shanghai are due to room rate in table 1 above Universal higher, compared to other areas, " amount of the loan " also can be universal higher, description field name " area " and " amount of the loan " it Between there is implicit connection.

Deletion type is for describing to have lacked possible between the field name of data value and other field names contact.It determines non- The deletion type of complete field name is filled up mode convenient for use accordingly and is filled up to the data value of missing.Deletion type includes Completely random missing, missing at random and Missing.It should be noted that non-fully the corresponding deletion type of field name can be with It is both missing at random and Missing, then terminal can be on demand using filling up mode calculates the non-fully field name accordingly Corresponding missing values.

Specifically, terminal can calculate the non-fully word currently to be filled up in tables of data has been determined non-fully after field name The degree of correlation in Duan Mingyu tables of data between other field names determines the non-fully field name currently to be filled up according to the degree of correlation Deletion type.

In one embodiment, step 206, according to related between non-fully field name and field names other in tables of data The determining non-fully deletion type of field name of degree includes: ought be non-fully related between field name and field names other in tables of data When degree is respectively less than the first preset value, it is determined that non-fully the deletion type of field name is completely random missing；When non-fully field When the degree of correlation in name and tables of data between at least one complete field name is greater than the second preset value, it is determined that non-fully field name Deletion type be missing at random；The degree of correlation between at least one in non-fully field name and tables of data non-fully field name When greater than third preset value, it is determined that non-fully the deletion type of field name is Missing.

Specifically, it if the settable corresponding threshold value of terminal, calculates in the non-fully field name currently to be filled up and tables of data The degree of correlation between other field names determines the non-fully field name currently to be filled up with the size relation of threshold value according to the degree of correlation Deletion type.

If it is pre- that the degree of correlation between the non-fully field name and other field names currently to be filled up is respectively less than first be arranged If value, illustrate that this is non-fully contacted between field name and remaining field name there is no implicit, thus this non-fully field name lacks Data value data value corresponding with other field names between also there is no association, calculate the non-fully corresponding missing of field name Without necessarily referring to the corresponding data value of other field names when value, so that it may determine that the deletion type of the non-fully field name is complete Missing at random.

If the degree of correlation between the complete field name of at least one in the non-fully field name currently to be filled up and tables of data is big In the second preset value of setting, illustrate that this non-fully has centainly implicit between field name and at least one complete field name System, thus this non-fully exist between data value data value corresponding at least one complete field name of field name missing it is certain Association, calculate this non-fully field name corresponding missing values when need to refer to the data value of at least one complete field name, Be assured that this non-fully field name deletion type be missing at random.

If the non-fully degree of correlation between field name of at least one in the non-fully field name currently to be filled up and tables of data Greater than the third preset value of setting, illustrate this non-fully field name and at least one non-fully exist between field name it is certain hidden Containing connection, so that this is non-fully deposited between data value of field name missing and at least one the non-fully corresponding data value of field name In certain association, calculate this non-fully field name corresponding missing values when need to refer to this at least one non-fully field name Data value, so that it may determine this non-fully field name deletion type be Missing.

Tables of data complementing method further includes the steps that calculating the degree of correlation in one of the embodiments: in statistics table The corresponding mean value of each field name and standard deviation；According to mean value and standard deviation, any two field name is calculated according to following formula Between the degree of correlation:

Wherein, ρ_(x,y)Indicate the degree of correlation between field name X and field name Y； μ_XIndicate the corresponding mean value of field name X；μ_YIndicate the corresponding mean value of field name Y；σ_XIndicate the corresponding standard deviation of field name X；σ_YTable Show the corresponding standard deviation of field name Y；E[(X-μ_X)(Y-μ_Y)] be Z desired value, Z=(X_i-μ_X)(Y_i-μ_Y)。

Specifically, terminal can obtain the data value of non-fully field name X, average μ to all data values_X, correspondingly, Obtain the corresponding data value of another field name Y, the data value all to field name Y is averaged μ_Y, then according to standard deviation with Relationship difference calculated field name X standard deviation corresponding with field name Y between mean value, can be calculated by the following formula to obtain:

Wherein, field name X mono- shares N number of data value, X_iIndicate corresponding i-th of the data value of field name X, then terminal can Each data value of Z is calculated according to each data value for the mean value and field name being calculated, i.e. i-th of data value of Z is (X_i-μ_X)(Y_i-μ_Y), then further according to the mean value of each data Data-Statistics Z of Z, as desired value.

In one embodiment, when calculating the non-fully degree of correlation between field name and other field names, if the two When the data Value Types of field name are numeric type, the degree of correlation directly can be calculated according to the data value of the two field names, if this Even having the data Value Types of at least one field name in field name is character type, then can first count the enumerated value of the field name, Corresponding data value is matched for each enumerated value, the data value of character type can be thus converted to the data value of numeric type, Then the degree of correlation is calculated according to matched data value.

For example, counting the corresponding enumerated value of the field name for the field name " educational background " in tables of data, comprising: rich Scholar, master, undergraduate course, junior college, special secondary school, junior middle school and unknown, it can successively be converted to corresponding data value, such as 6,5,4,3,2,1 And 0, alternatively, being successively converted to 100,80,70,60,50,20 and 0, then calculated according to the data value after conversion related Degree.Relationship before the relationship between each data value after conversion should and be converted between the data value of character type keeps one It causes.

Step 208, according to data value existing in tables of data, missing is calculated according to the corresponding mode of filling up of deletion type Value.

Specifically, after terminal has determined the corresponding deletion type of non-fully field name currently to be filled up, so that it may root Mode is filled up according to the deletion type is corresponding, calculating this according to data value existing in tables of data, non-fully field name is corresponding scarce Mistake value.Existing data can be roughly divided into two classes in tables of data, and one kind is the non-fully corresponding data value of field name, a kind of It is and the non-fully corresponding data value of the relevant field name of field name.

Deletion type is completely random missing in one of the embodiments,；Step 208, according to existing in tables of data Data value, according to deletion type it is corresponding fill up mode calculate missing values include: ought the non-fully corresponding data value class of field name When type is character type, then according to the non-fully existing corresponding median of data Data-Statistics of field name, the median of statistics is made For the corresponding missing values of non-fully field name；Or, will be united according to the existing corresponding mode of data Data-Statistics of non-fully field name The mode of meter is as the non-fully corresponding missing values of field name；When non-fully the corresponding data Value Types of field name are numeric type When, then according to the non-fully existing corresponding average of data Data-Statistics of field name, using the average of statistics as non-fully word The corresponding missing values of section name.

Specifically, when non-fully the corresponding deletion type of field name is that completely random lacks, then illustrate the non-fully word Contacting less between other existing data values of field name in the data value and tables of data of section name missing, then terminal can be according to this Non-fully field name itself existing data value calculates missing values.

The corresponding data Value Types of field name are character type, refer to that the type of the corresponding data value of the field name is character type , data Value Types are numeric type, refer to that the type of the corresponding data value of the field name is pure values type.For example, above In the table 1 referred to, the corresponding data Value Types of complete field name " age " are numeric types, and non-fully field name " educational background " is corresponding Data Value Types are character type, and non-fully the corresponding data value of field name " amount of the loan " is numeric type.

When terminal determines that and this is non-currently when the deletion type for the non-fully field name filled up is that completely random lacks When the complete corresponding data Value Types of field name are character type, then terminal can be according to the non-fully existing data primary system of field name Corresponding median is counted, using the median of statistics as the non-fully corresponding missing values of field name；Alternatively, terminal can also basis The non-fully existing corresponding mode of data Data-Statistics of field name, using the mode of statistics as this, non-fully field name is corresponding Missing values.

When terminal determines that and this is non-currently when the deletion type for the non-fully field name filled up is that completely random lacks When the complete corresponding data Value Types of field name are numeric type, then terminal can be according to the non-fully existing data primary system of field name Corresponding average is counted, using the average of statistics as the non-fully corresponding missing values of field name.

Step 210, the data value of non-fully field name missing is filled up according to missing values.

Specifically, terminal is calculating each non-fully field famous prime minister in tables of data according to above-mentioned steps 202 to step 204 After the missing values answered, so that it may fill up the data value of non-fully field name missing with respective missing values.Tables of data after filling up In there is no the data value of missing, in this way, the convenient tables of data based on after filling up carries out data analysis or data statistics.

Above-mentioned tables of data complementing method is determined that in the tables of data and is lacked when getting the tables of data of user's upload The non-fully field name of data value, determining according to the degree of correlation between other field names in this non-fully field name and tables of data should The non-fully deletion type of field name, then according to data value existing in tables of data according to the missing class of the non-fully field name Mode is filled up corresponding to type and calculates the non-fully corresponding missing values of field name, so that it may fill up this non-fully with the missing values The data value of field name missing can fill up the data value of each non-fully field name missing in tables of data according to above-mentioned steps, Tables of data can be effectively filled up, in this way, also can significantly mention based on the accuracy that the data that the tables of data after filling up is carried out are analyzed It is high.

As shown in figure 3, deletion type is completely random missing in one of the embodiments,；Step 208, according to data Existing data value in table, according to deletion type it is corresponding fill up mode and calculate missing values include:

Step 302, the first kind sample that the non-fully corresponding data value of field name has been lacked in tables of data is determined；

Step 304, the second class sample existing for the non-fully corresponding data value of field name is determined in tables of data；

Wherein, sample is the data entry recorded in tables of data, and each sample has respective number under each field name According to value.First kind sample is the sample of the non-fully field name currently to be filled up in tables of data corresponding data value missing, second Class sample is sample existing for the corresponding data value of non-fully field name currently to be filled up in tables of data.For example, being mentioned above And table 1 in, for the non-fully field name " amount of the loan " currently to be filled up, second sample belongs to first kind sample This, first sample, third sample and the 4th sample belong to the second class sample；And it is directed to currently to be filled up non-fully For field name " area ", the 4th sample belongs to first kind sheet, first sample, second sample and third sample category In the second class sample.

Step 306, the sample size of first kind sample is counted；

Step 308, the ratio that sample size accounts for total sample number is calculated；

Specifically, the deletion type of the non-fully field currently to be filled up is completely random missing, then illustrates currently to fill out Contacting less between other field names in the non-fully field name mended and tables of data.The sample of the statistics available first kind sample of terminal Quantity, the sample size for calculating first kind sample account for the ratio of total sample number in tables of data.

Step 310, when ratio is greater than threshold value, then data value of the first kind sample under non-fully field name is replaced with First value；Data value of the second class sample under non-fully field name is replaced with into second value.

When the ratio is greater than threshold value, illustrate that the corresponding data value of non-fully field name currently to be filled up in tables of data lacks The sample of mistake is more, for example threshold value can be set into 50%, that is non-complete what is currently filled up if there is the sample for being more than half The data value of full word section under one's name all lacks, certainly will will affect data analysis and data statistics, and this non-fully field name with The connection of other field names is little, then terminal can be by the non-fully corresponding data value binaryzation of field name, by first kind sample Originally the data value under this non-fully field name replaces with the first value；By data of the second class sample under this non-fully field name Value replaces with second value.

For example, terminal is determining that the non-fully field name " ID card No. " in tables of data belongs to completely random type Afterwards, and count be more than half sample belong to first kind sample, it is, be more than half sample " ID card No. " this What the data value under field name was missing from, then the sample of data value can will be present in " ID card No. " this field name in terminal Under data value replace with " 1 ", data value of the sample for having lacked data value under " ID card No. " this field name is replaced It is changed to " 0 ", although in this way, a large amount of data value has been lacked, due to other existing data in the data value and tables of data The association of value is little, and original data value is replaced with the mode of binaryzation, compared to directly deleting the non-fully institute under field name For some data values, and certain information can be remained.

As shown in figure 4, deletion type is missing at random in one of the embodiments,；Step 208, according in tables of data Existing data value, according to deletion type it is corresponding fill up mode and calculate missing values include:

Step 402, complete field name relevant to non-fully field name is determined；

Specifically, when the deletion type of non-fully field name is missing at random, illustrate non-fully field name and the data The complete field name of at least one in table is related, and terminal can determine and the non-fully field famous prime minister currently to be filled up according to step 206 The complete field name closed.

Step 404, the sample in tables of data is clustered according to the data value of complete field name, obtains clustering cluster；

Specifically, relevant to the non-fully field name currently to be filled up in tables of data has been determined at least one is complete for terminal After full field name, so that it may gather all samples in tables of data according to the corresponding data value of at least one complete field name Class obtains clustering cluster.

In one embodiment, terminal can be between the corresponding data value according at least one determining complete field name Similitude clusters all samples, alternatively, terminal can also be corresponding to multiple classes by the corresponding data value of complete field name In not, then clustered by the corresponding classification of data value.

For example, terminal can for complete field name " length of service " relevant to non-fully field name " year-end bonus " It to be clustered according to complete field name " length of service " to sample all in tables of data, for example can will be 1 year the length of service And 2 years samples are classified as one kind, and the sample that the length of service is 3 years to 5 years is classified as one kind, are 6 years to 8 years by the length of service Sample is classified as one kind, and the sample that the length of service is 8 years or more is classified as one kind.When complete field name relevant to " year-end bonus " has When multiple, in combination with the corresponding data value of this multiple complete field name by the sample clustering in tables of data, each clustering cluster is obtained.

Step 406, the third class sample that the non-fully corresponding data value of field name has been lacked in tables of data is determined；

Further, terminal counts the third class sample of the non-fully field name currently to be filled up in tables of data missing, And determine that these third class samples belong in which clustering cluster obtained in step 404.

Step 408, mean value of the sample under non-fully field name included by clustering cluster belonging to third class sample is calculated, it will The mean value being calculated is as missing values to be filled up.

Specifically, terminal can calculate all samples in the clustering cluster and exist after determining clustering cluster belonging to third class sample Mean value under the non-fully field name to be filled up will filled up the mean value being calculated as the sample fallen in the clustering cluster Non-fully field name under missing values.

It in the present embodiment, can will be the sample for having lacked data value after sample clustering when deletion type is missing at random The corresponding missing values of this calculating remove to fill up the number of all sample under this non-fully field name compared to the same missing values For value, the data value filled up is more accurate.

As shown in figure 5, deletion type is missing at random in one of the embodiments,；Step 208, according in tables of data Existing data value, according to deletion type it is corresponding fill up mode and calculate missing values include:

Step 502, first sample set existing for the non-fully corresponding data value of field name and non-is determined in tables of data Second sample set of the corresponding data value missing of field name completely；

In the present embodiment, when the deletion type of the non-fully field name to be filled up is missing at random, terminal can also root According in tables of data, non-fully the corresponding data value of the relevant complete field name of field name constructs prediction model to this, uses prediction model The prediction data value that non-fully field name lacks.Specifically, all samples in tables of data first can be divided into two classes by terminal, a kind of It is sample existing for the corresponding data value of non-fully field name currently to be filled up, the set that these samples are constituted is referred to as first Sample set, another kind of is the sample of the corresponding data value missing of the non-fully field name currently to be filled up, these samples are constituted Set be referred to as the second sample set.

Step 504, according to the corresponding data value of complete field name relevant to non-fully field name in first sample set Construct prediction model；

Further, terminal can determine complete field name relevant to the non-fully field name currently to be filled up, then obtain Take data value of all samples under determining complete field name in first sample set, establish these data values and this non-fully Projected relationship between the corresponding data value of field name.

Step 506, sample each in the second sample set is input to prediction mould in the corresponding data value of complete field name In type, predicted value of each sample under non-fully field name in the second sample set is exported by prediction model；

Step 508, using predicted value as missing values to be filled up.

For example, by all samples in tables of data according to non-fully field name m with the presence or absence of obtaining after being divided into two classes First sample set X=(001,002,003,005 ...), wherein 001 represents first sample, 002 represents the 2nd sample, It is such, the second sample set X '=(004,006 ...).Each sample is in non-fully field name m in first sample set X Under the set of corresponding data value be m=(m1, m2, m3, m5 ...)；Each sample is in non-fully field in second sample set X ' Corresponding data value is missing under name N1.Determine complete field name relevant to non-fully field name m, including n, p, q.It obtains Data value of each sample at complete field name n, p, q in first sample set X, according to n=(n1, n2, n3, n5 ...), p= The connection building hidden between (n1, n2, n3, n5 ...), q=(n1, n2, n3, n5 ...) and set m=(m1, m2, m3, m5 ...) Prediction model:

M=nw1+pw2+qw3+b, wherein w1, w2, w3 and b are trainable model parameters.

Here model is an example, is only used for indicating that the input of prediction model is n, p and q, output is m.It is constructing The mode that gradient is successively decreased can be used when prediction model and adjust model parameter, the prediction model of building is enabled to be bonded first sample Each sample in set.

After having obtained prediction model, so that it may by sample each in the second sample set at complete field name n, p, q Data value as input, be input in prediction model, each sample exported at non-fully field name m by the prediction model Corresponding data value, so that it may which the data value for using the predicted value of output as missing is filled, in this way, each sample is non-complete Corresponding missing values are not all identical under full field name m, but have very big connection with relevant complete field name, It is able to ascend the preparatory of missing values to be filled up.

In a specific embodiment, tables of data complementing method specifically includes the following steps:

Obtain the tables of data that user uploads.

Determine the non-fully field name that data value has been lacked in tables of data.

The corresponding mean value of each field name and standard deviation in statistics table.

According to mean value and standard deviation, the degree of correlation between any two field name is calculated according to following formula:

When the degree of correlation in non-fully field name and tables of data between other field names is respectively less than the first preset value, then really The deletion type of fixed non-fully field name is completely random missing.

When the degree of correlation in non-fully field name and tables of data between at least one complete field name is greater than the second preset value When, it is determined that non-fully the deletion type of field name is missing at random.

When the degree of correlation between at least one in non-fully field name and tables of data non-fully field name is default greater than third When value, it is determined that non-fully the deletion type of field name is Missing.

When deletion type be completely random missing and when non-fully the corresponding data Value Types of field name are character type, Then according to the non-fully existing corresponding median of data Data-Statistics of field name, using the median of statistics as non-fully field name Corresponding missing values；Or, according to the existing corresponding mode of data Data-Statistics of non-fully field name, using the mode of statistics as non- The corresponding missing values of complete field name.

When deletion type be completely random missing and when non-fully the corresponding data Value Types of field name are numeric type, Then according to the non-fully existing corresponding average of data Data-Statistics of field name, using the average of statistics as non-fully field name Corresponding missing values；

Alternatively,

When deletion type is that completely random lacks, determines and lacked the corresponding data value of non-fully field name in tables of data First kind sample；Determine in tables of data the second class sample existing for the non-fully corresponding data value of field name；Count the first kind The sample size of sample；Calculate the ratio that sample size accounts for total sample number；When ratio is greater than threshold value, then first kind sample is existed Non-fully the data value under field name replaces with the first value；Data value of the second class sample under non-fully field name is replaced with Second value.

When deletion type is missing at random, it is determined that complete field name relevant to non-fully field name；According to complete The data value of field name clusters the sample in tables of data, obtains clustering cluster；It determines and has lacked non-fully word in tables of data The third class sample of the corresponding data value of section name；Sample included by clustering cluster belonging to third class sample is calculated in non-fully field Mean value under one's name, using the mean value being calculated as missing values to be filled up；

Alternatively,

When deletion type is missing at random, it is determined that in tables of data non-fully the existing for the corresponding data value of field name Second sample set of the corresponding data value missing of one sample set and non-fully field name；According in first sample set with Non-fully the corresponding data value of the relevant complete field name of field name constructs prediction model；By sample each in the second sample set It is input in prediction model in the corresponding data value of complete field name, each sample in the second sample set is exported by prediction model Originally the predicted value under non-fully field name；Using predicted value as missing values to be filled up.

The data value of non-fully field name missing is filled up according to missing values.

It should be understood that although each step in the flow chart of Fig. 2 to Fig. 5 is successively shown according to the instruction of arrow, But these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, these There is no stringent sequences to limit for the execution of step, these steps can execute in other order.Moreover, Fig. 2 is into Fig. 5 At least part step may include that perhaps these sub-steps of multiple stages or stage are not necessarily same to multiple sub-steps One moment executed completion, but can execute at different times, and the execution in these sub-steps or stage sequence is also not necessarily Be successively carry out, but can at least part of the sub-step or stage of other steps or other steps in turn or Alternately execute.

In one embodiment, as shown in fig. 6, providing a kind of tables of data fills up device 600, comprising: tables of data obtains Module 602, non-fully field name determining module 604, deletion type determining module 606, missing values computing module 608 and fill up mould Block 610, in which:

Tables of data obtains module 602, for obtaining the tables of data of user's upload；

Non-fully field name determining module 604, for determining the non-fully field name in tables of data, non-fully field name is lacked Few data value；

Deletion type determining module 606, for according to the non-fully phase between field name and field names other in tables of data Guan Du determines the deletion type of non-fully field name；

Missing values computing module 608, for being filled up according to deletion type is corresponding according to data value existing in tables of data Mode calculates missing values；

Module 610 is filled up, for filling up the data value of non-fully field name missing according to missing values.

Deletion type determining module 606 is also used to each field name pair in statistics table in one of the embodiments, The quantity for the data value answered；Determine the corresponding total sample number of tables of data；When quantity is less than total sample number, field name is determined as Non-fully field name.

Be also used to ought be non-fully in field name and tables of data for deletion type determining module 606 in one of the embodiments, When the degree of correlation between other field names is respectively less than the first preset value, it is determined that non-fully the deletion type of field name be completely with Machine missing；When the degree of correlation in non-fully field name and tables of data between at least one complete field name is greater than the second preset value When, it is determined that non-fully the deletion type of field name is missing at random；When non-fully field name and at least one in tables of data are non- When the degree of correlation between complete field name is greater than third preset value, it is determined that non-fully the deletion type of field name is nonrandom lacks It loses.

Deletion type is completely random missing in one of the embodiments,；Missing values computing module 608 is also used to when non- It is when the complete corresponding data Value Types of field name are character type, then corresponding according to the non-fully existing data Data-Statistics of field name Median, using the median of statistics as the corresponding missing values of non-fully field name；Or, according to the existing number of non-fully field name According to the corresponding mode of Data-Statistics, using the mode of statistics as the corresponding missing values of non-fully field name；When non-fully field name pair When the data Value Types answered are numeric type, then according to the non-fully existing corresponding average of data Data-Statistics of field name, it will unite The average of meter is as the non-fully corresponding missing values of field name.

Deletion type is completely random missing in one of the embodiments,；Missing values computing module 608 is also used to determine The first kind sample of the non-fully corresponding data value of field name has been lacked in tables of data；Determine in tables of data non-fully field name pair Second class sample existing for the data value answered；Count the sample size of first kind sample；It calculates sample size and accounts for total sample number Ratio；When ratio is greater than threshold value, then data value of the first kind sample under non-fully field name is replaced with into the first value；By Data value of the two class samples under non-fully field name replaces with second value.

Deletion type is missing at random in one of the embodiments,；Missing values computing module 608 be also used to it is determining with it is non- The relevant complete field name of complete field name；The sample in tables of data is clustered according to the data value of complete field name, is obtained To clustering cluster；Determine the third class sample that the corresponding data value of non-fully field name has been lacked in tables of data；Calculate third class sample Mean value of the sample under non-fully field name included by clustering cluster belonging to this is lacked the mean value being calculated as to be filled up Mistake value.

Deletion type is missing at random in one of the embodiments,；Missing values computing module 608 is also used to determine data Non-fully first sample set existing for the corresponding data value of field name and non-fully the corresponding data value of field name lacks in table The second sample set lost；According to the corresponding data value of complete field name relevant to non-fully field name in first sample set Construct prediction model；Sample each in second sample set is input to prediction model in the corresponding data value of complete field name In, predicted value of each sample under non-fully field name in the second sample set is exported by prediction model；Predicted value is made For missing values to be filled up.

It further includes relatedness computation module that tables of data, which fills up device 600, in one of the embodiments,；Relatedness computation mould Block is for the corresponding mean value of field name each in statistics table and standard deviation；According to mean value and standard deviation, according to following formula Calculate the degree of correlation between any two field name:

Above-mentioned tables of data fills up device 600, when getting the tables of data of user's upload, determines that and lacks in the tables of data The non-fully field name of data value is determined according to the degree of correlation between other field names in this non-fully field name and tables of data The deletion type of the non-fully field name, then according to data value existing in tables of data according to the missing of the non-fully field name Mode is filled up corresponding to type and calculates the non-fully corresponding missing values of field name, so that it may which it is non-complete to fill up this with the missing values The data value of full field name missing can fill up the data of each non-fully field name missing in tables of data according to above-mentioned steps Value, can effectively fill up tables of data, in this way, also can be significant based on the accuracy that the data that the tables of data after filling up is carried out are analyzed It improves.

The specific of device 600, which is filled up, about tables of data limits the limit that may refer to above for tables of data complementing method Fixed, details are not described herein.Above-mentioned tables of data fill up the modules in device 600 can fully or partially through software, hardware and A combination thereof is realized.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also Be stored in the memory in computer equipment in a software form, in order to which processor calls the above modules of execution corresponding Operation.

In one embodiment, a kind of computer equipment is provided, which can be terminal, internal structure Figure can be as shown in Figure 7.The computer equipment includes processor, memory, the network interface and defeated connected by system bus Enter device.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system and computer program.This is interior Memory provides environment for the operation of operating system and computer program in non-volatile memory medium.The computer equipment Network interface is used to communicate with external terminal by network connection.To realize one kind when the computer program is executed by processor Tables of data complementing method.The input unit of the computer equipment can be the touch layer covered on display screen, be also possible to calculate Key, trace ball or the Trackpad being arranged on machine equipment shell can also be external keyboard, Trackpad or mouse etc..

It will be understood by those skilled in the art that structure shown in Fig. 7, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment It may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment, tables of data provided by the present application, which fills up device, can be implemented as a kind of shape of computer program Formula, computer program can be run in computer equipment as shown in Figure 7.Composition can be stored in the memory of computer equipment should Tables of data fills up each program module of device 600, for example, tables of data shown in fig. 6 obtains module 602, non-fully field name Determining module 604, deletion type determining module 606, missing values computing module 608 and fill up module 610.Each program module structure At computer program make processor execute the tables of data complementing method of each embodiment of the application described in this specification In step.

For example, computer equipment shown in Fig. 7 can fill up the data in device 600 by tables of data as shown in FIG. 6 Table obtains module and executes step S202.Computer equipment can execute step S204 by non-fully field name determining module.It calculates Machine equipment can execute step S206 by deletion type determining module.Computer equipment can execute step by missing values computing module Rapid S208.Computer equipment can execute step S210 by filling up module.

In one embodiment, a kind of computer equipment, including memory and processor are provided, memory is stored with meter Calculation machine program, when computer program is executed by processor, so that the step of processor executes above-mentioned tables of data complementing method.Herein The step of tables of data complementing method, can be the step in the tables of data complementing method of above-mentioned each embodiment.

In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored with When sequence is executed by processor, so that the step of processor executes above-mentioned tables of data complementing method.Tables of data complementing method herein Step can be the step in the tables of data complementing method of above-mentioned each embodiment.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the concept of this application, various modifications and improvements can be made, these belong to the protection of the application Range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of tables of data complementing method, which comprises

Obtain the tables of data that user uploads；

The non-fully word is determined according to the degree of correlation in the non-fully field name and the tables of data between other field names The deletion type of section name；

According to data value existing in the tables of data, missing values are calculated according to the corresponding mode of filling up of the deletion type；

2. the method according to claim 1, wherein non-fully field name and the tables of data according to In the degree of correlation between other field names determine that the deletion type of the non-fully field name includes:

When the degree of correlation in the non-fully field name and the tables of data between other field names is respectively less than the first preset value, Then determine the deletion type of the non-fully field name for completely random missing；

When the degree of correlation in the non-fully field name and the tables of data between at least one complete field name is greater than second in advance If when value, it is determined that the deletion type of the non-fully field name is missing at random；

When the degree of correlation between at least one in the non-fully field name and the tables of data non-fully field name is greater than third When preset value, it is determined that the deletion type of the non-fully field name is Missing.

3. the method according to claim 1, wherein the deletion type is completely random missing；The basis Existing data value in the tables of data, according to the deletion type it is corresponding fill up mode and calculate missing values include:

It is when the corresponding data Value Types of the non-fully field name are character type, then existing according to the non-fully field name The corresponding median of data Data-Statistics, using the median of statistics as the non-fully corresponding missing values of field name；Or, According to the existing corresponding mode of data Data-Statistics of the non-fully field name, as described in non-fully using the mode of statistics The corresponding missing values of field name；

It is when the corresponding data Value Types of the non-fully field name are numeric type, then existing according to the non-fully field name The corresponding average of data Data-Statistics, using the average of statistics as the non-fully corresponding missing values of field name.

4. the method according to claim 1, wherein the deletion type is completely random missing；The basis Existing data value in the tables of data, according to the deletion type it is corresponding fill up mode and calculate missing values include:

Count the sample size of the first kind sample；

When the ratio is greater than threshold value, then data value of the first kind sample under the non-fully field name is replaced with First value；Data value of the second class sample under the non-fully field name is replaced with into second value.

5. the method according to claim 1, wherein the deletion type is missing at random；It is described according to Existing data value in tables of data, according to the deletion type it is corresponding fill up mode and calculate missing values include:

Determine complete field name relevant to the non-fully field name；

Mean value of the sample included by clustering cluster belonging to the third class sample under the non-fully field name is calculated, will be calculated Obtained mean value is as missing values to be filled up.

6. the method according to claim 1, wherein the deletion type is missing at random；It is described according to Existing data value in tables of data, according to the deletion type it is corresponding fill up mode and calculate missing values include:

Determine described in the tables of data first sample set existing for the non-fully corresponding data value of field name and described non- Second sample set of the corresponding data value missing of field name completely；

According to the corresponding data value building of complete field name relevant to the non-fully field name in the first sample set Prediction model；

Each sample in second sample set is input to the prediction mould in the corresponding data value of the complete field name In type, prediction of each sample under the non-fully field name in second sample set is exported by the prediction model Value；

Using the predicted value as missing values to be filled up.

7. method according to any one of claims 1 to 6, which is characterized in that the method also includes:

Wherein, ρ_(x,y)Indicate the degree of correlation between field name X and field name Y；μ_XIndicate the corresponding mean value of field name X；μ_YIndicate word The corresponding mean value of section name Y；σ_XIndicate the corresponding standard deviation of field name X；σ_YIndicate the corresponding standard deviation of field name Y；E[(X-μ_X) (Y-μ_Y)] be Z desired value, Z=(X- μ_X)(Y-μ_Y)。

8. a kind of tables of data fills up device, which is characterized in that described device includes:

Non-fully field name determining module, for determining the non-fully field name in the tables of data, the non-fully field name Lack data value；

Deletion type determining module, for according to the non-fully phase between field name and other field names in the tables of data Guan Du determines the deletion type of the non-fully field name；

Missing values computing module, for being filled out according to the deletion type is corresponding according to data value existing in the tables of data Benefit mode calculates missing values；

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.