WO2019075599A1

WO2019075599A1 - Data filling method and device

Info

Publication number: WO2019075599A1
Application number: PCT/CN2017/106280
Authority: WO
Inventors: 赵敏; 林磊
Original assignee: 深圳乐信软件技术有限公司
Priority date: 2017-10-16
Filing date: 2017-10-16
Publication date: 2019-04-25
Also published as: CN109564641B; CN109564641A

Abstract

A data filling method and device. The method may comprise: acquiring sample data and an objective function, wherein the sample data comprises data corresponding to at least one parameter of salary income, working time and repayment records, the objective function takes the at least one parameter as an independent variable, and the output target variable of the objective function is a user's probability of being overdue; traversing the sample data according to the independent variable included in the objective function to obtain a traversal result; calculating, according to the traversal result, a data missing rate corresponding to the independent variable; and filling in, according to a missing rate interval that the data missing rate belongs to, a missing value in the sample data corresponding to the independent variable by using the corresponding data filling method, wherein different missing rate intervals correspond to different data filling methods, and the data filling methods comprise at least two of tag grouping filling, BETA distribution filling, random drawing filling, logistic regression filling and mean filling.

Description

Data filling method and device

Technical field

The present disclosure relates to the field of data processing technologies, for example, to a data padding method and apparatus.

Background technique

In the big data environment, due to the diversification of data sources and data generation methods, in many data application scenarios, data loss may occur, and missing data may carry useful or critical information, if not missing If the values are properly processed, the missing data may affect the construction of subsequent models, such as the construction of models such as logistic regression and neural networks, and reduce the training effect of the machine learning model.

In the field of e-commerce, when evaluating the credit of users, the corresponding machine learning model is usually used to calculate the overdue probability of the user, and then the credit of the user is evaluated. If there is data missing in the user sample data during machine training, the training may be made. Obtaining the machine learning model cannot accurately calculate the user's overdue probability, which makes it impossible to provide the user with a highly matched service, such as adjusting the user's credit limit. In the related art, the missing value is usually filled by manual filling. Large amounts, low efficiency, and relying on human experience cannot guarantee the validity of the data being filled.

Summary of the invention

The present disclosure provides a data padding method and apparatus, which can improve the efficiency of data padding. This embodiment provides a data padding method, which may include:

Obtaining sample data and an objective function, wherein the sample data includes wage income, working time Data corresponding to at least one parameter in the repayment record, the objective function having the at least one parameter as an independent variable, and an output target variable of the objective function being a user's overdue probability;

Traversing the sample data according to the independent variable included in the objective function to obtain a traversal result;

Calculating a data deletion rate corresponding to the independent variable according to the traversal result;

According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding. The embodiment further provides a data filling device, which may include:

An acquisition module, configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable The output target variable of the objective function is a user's overdue probability;

a missing rate calculation module, configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;

The data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.

The embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the above methods.

This embodiment also provides a data processing device, the data processing device including one or more processes And a memory and one or more programs, the one or more programs being stored in a memory, and when executed by one or more processors, performing any of the methods described above.

The embodiment further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.

This embodiment can improve the filling efficiency of the data missing value, and can ensure the validity of the data filling, so that the filled data can be calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model. It can improve the accuracy of the calculation results of overdue probability, and thus provide users with higher matching services.

DRAWINGS

FIG. 1 is a flowchart of a data padding method according to an embodiment;

2 is a flowchart of another data padding method according to an embodiment;

FIG. 3A is a flowchart of another data padding method according to an embodiment; FIG.

FIG. 3B is a graph showing a BETA distribution corresponding to different parameter values α and β according to an embodiment; FIG.

4 is a flowchart of another data padding method according to an embodiment;

FIG. 5 is a structural block diagram of a data missing value filling apparatus according to an embodiment; FIG.

FIG. 6 is a schematic structural diagram of hardware of a data processing device according to an embodiment.

Detailed ways

1 is a flowchart of a data padding method provided by this embodiment. This embodiment is applicable to a case where padding data is padded. The method may be performed by a computing device such as a computer, and the method may be performed by a data padding device. The data filling device can be implemented in at least one of software and hardware. As shown in FIG. 1, the method provided in this embodiment may include the following steps:

In step 110, the sample data and the objective function are acquired, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, and the objective function takes the at least one parameter as an independent variable. The output target variable of the objective function is the user's overdue probability.

The sample data may also be called original data, and the objective function may include a logistic regression model function and a neural network model function. The target variable output by the logic function may be a user's repayment overdue probability, referred to as an overdue probability, and the original data may be a predicted user. The sample data of the overdue probability, for example, the sample data may include information such as the user's salary income, working years, and the user's repayment record, and the sample data may be referred to as an independent variable. Missing data can be referred to as missing values, and missing values represent the data content of some of the missing data in the acquired raw data (such as big data). The existence of missing values in the original data may lead to the use of the corresponding objective function for modeling or learning training, which makes the establishment of the model biased, and the learning training effect is not satisfactory.

Among them, the missing value may be caused by mechanical reasons (such as data loss caused by data collection or preservation) or human reasons (such as staff's subjective errors or historical limitations). According to the distribution of missing values, the missing values can be divided into completely random missing (the missing data is random, the missing data does not depend on any incomplete or complete variables), random missing (the missing data is not completely random) , that is, the lack of such data depends on other complete variables) and the complete non-random deletion (meaning that the lack of data depends on the incomplete variable itself). According to the attributes of the missing values, the missing values can be classified as single-valued deletions (the same attribute of the missing value) and any missing (the attribute of the missing value is different).

In step 120, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.

The data missing rate in the raw data can be determined by a code program. For example, a logistic regression model function consists of seven independent variables, each of which contains multiple data, which are read sequentially by the program. When the return value is null, the data is missing, the number of missing data is increased by 1, and after all the data is traversed in turn, the data missing rate of the original data can be counted.

For example, the sample data includes 100 user information, 70 people's salary information, and the remaining 30 people's salary information is missing. The data loss rate corresponding to the independent variable of salary information is 30%, and the salary information of these 30 people needs to be filled. .

In step 130, according to the missing rate interval to which the data missing rate belongs, a corresponding data filling manner is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data. The method of filling, the data filling method includes at least two types of label group filling, beta BETA distribution filling, random extraction filling, logistic regression filling, and mean filling.

The data missing information can be automatically filled according to the data deletion rate determined in step 120 to complete the filling of the missing data. The objective function may be a function for calculating the expected probability of the user, traversing the original data containing the user information according to variables involved in the objective function, such as the user's salary income, working years, and the user's repayment record, according to each variable Traversing the result, calculating the data missing rate of the variable, taking the data filling method according to the data missing rate, and filling the missing sample data in the variable to ensure the integrity of the sample data.

Optionally, when the data missing rate is high, such as when it reaches 99% or more, a data abnormality alarm may be issued, and the alarm content may be “recommended manual detection”, or the original data may be directly discarded; when the data missing rate is compared Low, that is, most of the data is complete and only a small part of the data is missing. If the data missing rate is less than 5%, the data can be filled by logistic regression; when the data missing rate is in the (70%, 99%) interval When the missing rate is in the (5%, 70%) interval, the missing value can be filled by the method of filling the BETA distribution.

In this embodiment, the original data is reasonably reserved, and the problem of the amount of data being dropped due to the complete deletion of the data content due to the absence of one or a part of the variables is avoided, according to different data. The missing rate adopts different data filling methods. When the original information and attributes of the missing value part are retained, the distribution of the data and the attribute of the missing data are reduced, and the data filling can be automatically performed, and the data filling efficiency is improved. Reduce the labor burden.

In the related art, data missing values can be filled by deleting data records, mean padding or manual padding. When the method of deleting data records is adopted, when the sample size is small and the training model data is insufficient, the overall training effect of the model will be seriously affected; if the mean value filling method is adopted, the data loss rate will be serious if the data missing rate is high. Affecting the distribution state of the original non-missing value, the original non-missing value distribution is gathered at a certain point. For the non-randomness missing, after filling, the information covered by the missing value will be hidden; the defect of the manual filling method is that In a large data environment with large data volume, manual filling is heavy and inefficient, and it relies heavily on human experience and is not suitable for machine learning environments.

The embodiment provides a data padding method, which determines the data missing rate of the sample data by acquiring the original data with the missing data and the objective function, and adopts a corresponding data padding method to perform data deletion according to the size of the data missing rate. The padding of the data includes at least one of tag group padding, BETA distribution padding, random padding padding, logistic regression padding, and mean padding, which improves the filling efficiency of data missing values and ensures the validity of data padding. Therefore, when the filled data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service. .

FIG. 2 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 2, the method may include the following steps:

In step 210, sample data and an objective function are obtained.

The sample data includes data corresponding to at least one parameter of wage income, working time, and repayment history, and the objective function takes the at least one parameter as an independent variable, and the objective function The output target variable is the user's overdue probability.

In step 220, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.

In step 230, when the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values in a manner of label group padding.

The data loss rate is greater than 70% and less than 99% is a serious data loss. When the data is seriously missing, the label grouping method can improve the data filling efficiency.

For example, the data can be padded by two sets of markings (1/0), as shown in Table 1:

Table 1

用户编号user ID	X1X1	X11X11
001001	..	11
002002	0.90.9	00
003003	0.80.8	00
004004	..	11

For the variable X1, if the user's data of the user numbers 001 and 004 is missing, a corresponding dummy variable (X11) can be added accordingly, and the 001 user and the 004 user are assigned the value 1 in the X11, the user 002 and If the value of the X1 variable of the user 003 is not missing, the user 002 and the user 003 are both assigned 0 in X11, and the padding of the missing value is completed. Alternatively, variables with a high deletion rate (eg, a deletion rate greater than 99%) can be directly deleted.

In the data padding method provided in this embodiment, if the data deletion rate is greater than 70% and less than 99%, the data missing value padding is performed by means of label group padding, that is, the label grouping is used in the case where the data missing rate is high. Ways to improve the efficiency of data filling.

FIG. 3A is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 3A, the method provided in this embodiment may include the following steps:

In step 310, sample data and an objective function are obtained.

The sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.

In step 320, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.

In step 330, when the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, if otherwise, the step is performed. 340, if yes, perform step 350.

Correlation refers to the monotonic relationship between the variable and the target variable. The spearman correlation function can be used for correlation judgment. The spearman is a non-parametric statistical method, which does not depend on the distribution of variables, that is, whether the non-missing value is a normal distribution or Non-normal distribution can find the degree and direction of the relationship between the non-missing value and the objective function. According to the degree of monotonic correlation between the variable and the target variable, the Spearman's rank correlation coefficient (Spearman's rank correlation coefficient) is calculated. The Spearman coefficient can reflect the degree of correlation between the non-missing value (that is, the above variable) and the target variable. Close to 1 or -1, the greater the degree of correlation, where the Spearman coefficient is positive for positive correlation and negative for negative correlation.

The threshold range of the Spearman coefficient can be set. If the variable and the target variable Spearman coefficient satisfy the set threshold range, it is significantly correlated. When the variable and the target variable Spearman coefficient do not satisfy the set threshold range, it is non-significant correlation.

In step 340, randomly extracting data from the non-missing values performs padding of missing values on sample data corresponding to the independent variable.

When it is judged that the non-missing value in the original data is not significantly correlated with the target variable, the data is randomly extracted in the non-missing value for padding.

In step 350, it is determined whether the non-missing value is significantly related to the target variable, and if so, step 360 is performed, if otherwise, step 370 is performed.

If the non-missing value in the original data is significantly correlated with the target variable, it is determined whether the non-missing value is significantly correlated with the dependent variable. A univariate regression model can be established by non-missing values and target variables, such as: Y=β0+β1X, Y is the target variable, and X is the non-missing value. According to the formula, the values of β0 and β1 can be calculated, where β1 is 0, it means that the non-missing value is independent of the dependent variable. If β1 is not 0, it means that the non-missing value is related to the target variable.

In step 360, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is padded with missing values by using the BETA distribution.

The degree of difference refers to the degree of difference between the target variable corresponding to the missing value and the target variable corresponding to the non-missing value, and the degree of the difference may be determined according to the analysis of variance, for example, the expected probability of multiple users who have salary information and no wages. The overdue probability of the plurality of users of the information is separately calculated by the variance, and the degree of the difference is determined based on the result of the variance calculation.

For example, a left-right or right-biased distribution can be formed by adjusting the parameters α and β in the BETA distribution, that is, constructing a left-right or right-biased BETA distribution within a range of values of the non-missing value.

Alternatively, in the extreme part of the non-missing value distribution, a random dispersion method may be used to fill the missing values. The above extreme part can be understood as the data range in which the maximum or minimum value of the non-missing value is located.

Among them, the bias of the BETA distribution is related to the missing value part, the non-missing part and the target variable of the variable. turn off. The skewness of the BETA distribution is determined by the correlation between the missing value part and the target variable. For example, the greater the degree of correlation, the greater the skewness of the left or right deviation of the BEAT distribution, and the randomly generated value used to fill the missing value is extreme. The higher the probability of a value, the extreme value can be understood as the maximum or minimum value, or the value within the data range containing the maximum or minimum value.

For example, the average value of the BETA distribution AVG = α / (α + β), the variance of the BETA distribution VAR = α * β / ((α + β) ^ 2 * (α + β + 1)), derived therefrom ( Where r is the intermediate variable):

r=(AVG*(1-AVG)/VAR)-1

==AVG*r

β=(1-AVG)*r

That is, α and β are parameters that jointly determine the morphology of the BETA distribution. When β>α, the probability that the missing value takes a small value is large, that is, the distribution pattern is right-biased, and when β<α, the value of the missing value is large. The largeness, that is, the distribution pattern is left-biased, and the shape of the BETA distribution depends on the AVG. From this, it can be seen that when the AVG is between the minimum value MIN and the intermediate value P50 among the non-missing values, the probability of the BETA distribution being large is large. , that is, left-biased; when the AVG is between the intermediate value P50 and the maximum value MAX in the non-missing value, the probability that the BETA distribution is small is large, that is, right-biased. Illustratively, the BETA distribution curves corresponding to different α and β values are shown in FIG. 3B, and FIG. 3B is a graph of BETA distribution corresponding to different parameter values α and β provided by the present embodiment.

In this embodiment, according to the missing value, the non-missing value and the correlation ρ of the target variable, P50, MAX and MIN in the non-missing value jointly determine α and β in the estimated value distribution corresponding to the missing value, and further Determine the shape of the BETA distribution. A new average value New_AVG is constructed by P50, MAX and MIN in the non-missing value, and α and β are calculated by New_AVG and VAR of the non-missing value part, wherein New_AVG is calculated as follows:

When the missing value is more likely to take a small value (ie when the distribution is right-biased):

New_AVG=(MAX-P50)*|ρ|+P50;

When the missing value is more likely to take a large value (ie when the distribution is left-biased):

New_AVG=P50-(P50-MIN)*|ρ|.

In step 370, the sample data corresponding to the argument is padded with missing values in a label grouping manner.

The embodiment provides a data filling method, improves the filling efficiency of the missing data value, and ensures the validity of the data filling, so that the filled data is calculated by modeling or machine learning, for example, by a machine learning model. When the user's credit overdue probability, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

FIG. 4 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 4, the method provided in this embodiment may include the following steps:

In step 410, sample data and an objective function are obtained.

In step 420, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.

In step 430, when the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable, if otherwise, step 440 is performed, and if yes, step 450 is performed.

In step 440, the sample data corresponding to the argument is padded with missing values in a mean padding manner.

Wherein, the mean padding refers to calculating the mean value of the non-missing part of the variable, and filling the mean value into the missing value part. Alternatively, the mean can also be replaced by a median or a mode.

In step 450, the missing sample values are filled in the sample data corresponding to the independent variable by means of logistic regression.

For example, a univariate regression model is established by non-missing values and target variables, that is, a non-missing value X and a target variable Y (log(P/1-P) in logistic regression) are used to establish a univariate logistic regression model, and β0 is calculated ( Intercept) and β1 (Estimate), then based on the target variable Y (the average of Y in the missing part) of the missing part and β0 and β1 obtained in the previous step, the estimated value of the missing value is introduced 1X=(Y-β0)/β1 The 1X value is filled as a missing value.

This embodiment provides a data padding method, which improves the filling efficiency of data missing values, so that the filled data is more accurate when performing modeling or machine learning calculations.

On the basis of the foregoing content, after the data missing value is filled according to the size of the data missing rate, the method further includes: calculating a weight value of the variable in the original data, according to the weight value and the filling Data, a trust index that determines the results of subsequent calculations based on data that has been filled with missing data values.

In the process of filling the missing values in the original data by the data padding method provided by the present disclosure, the missing missing values are recorded by corresponding data, and the subsequent related computing models are calculated according to the filled data. After the result, the trust index of the result can be given.

For example, a logistic regression model has seven independent variables X1-X7, and the weight value (% of importance) of each independent variable can be estimated indirectly by Wald statistic (Wald ChiSq). Optionally, the trust index may be the sum of the weight values of the individual independent variables that are not missing, and the statistical process and statistical results are shown in Table 2:

Table 2

Optionally, before the obtained padded data is sent to the machine learning, whether the data is discarded may be determined according to the level of the trust index.

Optionally, machined learning is performed on the padded data with a trust index greater than 60% to improve learning efficiency while achieving better learning outcomes.

FIG. 5 is a structural block diagram of a data missing value padding apparatus according to the embodiment. The device can perform the data padding method provided by the foregoing embodiment, and has the corresponding functional modules and beneficial effects of the execution method. As shown in FIG. 5, the apparatus may specifically include: an obtaining module 501, a missing rate calculating module 502, and a data filling module 503.

The obtaining module 501 is configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, and the objective function is the at least one parameter As an independent variable, the output target variable of the objective function is the user's overdue probability.

The missing rate calculation module 502 is configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculate a data missing rate corresponding to the independent variable according to the traversal result.

The data padding module 503 is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The method of data padding includes at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.

In this embodiment, the data missing rate of the original data is determined by acquiring the original data with the missing data and the objective function, and the data missing method is used to fill the data missing value according to the size of the data missing rate. The data filling method includes at least one of label group filling, BETA distribution filling, random extraction filling, logistic regression filling, and mean filling, which improves the filling efficiency of data missing values, and can ensure the validity of data filling, so that the filling is completed. When the data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

Optionally, the data padding module 503 is configured to: if the data missing rate is greater than 70% and less than 99%, perform data missing value padding by using a label grouping padding manner.

Optionally, the data padding module 503 is configured to: if the data deletion rate is greater than 5% and less than or equal to 70%, determine whether the target variable corresponding to the missing value corresponding to the missing value in the sample data is There is a significant difference; when there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to perform sample data corresponding to the independent variable. Filling in missing values.

Optionally, the data padding module 503 is further configured to: when the missing value in the sample data corresponds to When there is a significant difference between the target variable and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable; when the non-missing value is significantly correlated with the target variable, according to the relevant direction and degree of difference A left-right or right-biased BETA distribution is constructed, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. If the non-missing value is not significantly correlated with the target variable, the tag grouping filler is used to perform missing value padding on the sample data corresponding to the independent variable.

Optionally, the data padding module 503 is configured to: if the data missing rate is less than 5%, determine whether the non-missing value in the sample data is significantly correlated with the target variable; if the non-missing value in the sample data If there is no significant correlation with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding. If the non-missing value in the sample data is significantly correlated with the target variable, the logistic regression method is used. The sample data corresponding to the independent variable is filled with missing values.

Optionally, the device may further include a padding result evaluation module, configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable. After the filling, the weight value of the independent variable in the objective function is calculated, and the trust index of the result of the subsequent calculation based on the data after the data missing value is determined according to the weight value and the padded data. That is, when the filled data is used to calculate the credit expectation probability of the user, the accuracy of the calculation result is evaluated.

The embodiment further provides a storage medium comprising computer executable instructions for performing a data padding method when executed by a computer processor, the method comprising:

Obtaining sample data and an objective function, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function The output target variable is the user's overdue probability;

According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.

The computer-executable instructions can also perform any of the data padding methods provided by the foregoing embodiments when executed by the computer processor. For reference, the flow of the method provided by the foregoing embodiments may be referred to.

The present embodiment further provides a data processing device, which may be a filler, as shown in FIG. 6 , which is a hardware structure diagram of a data processing device provided by this embodiment, where the data processing device may include: A processor 610 and a memory 620; may further include a communication interface 630 and a bus 640.

The processor 610, the memory 620, and the communication interface 630 can complete communication with each other through the bus 640. Communication interface 630 can be used for information transmission. Processor 610 can invoke logic instructions in memory 620 to perform any of the methods of the above-described embodiments.

The memory 620 can include a storage program area and a storage data area, and the storage program area can store an operating system and an application required for at least one function. The storage data area can store data and the like created according to the use of the data processing device. Further, the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.

Moreover, when the logic instructions in memory 620 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium. The technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) Performing the method of the embodiment All or part of the steps.

All or part of the processes in the foregoing embodiment may be completed by a computer program indicating related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, may include the above The flow of an embodiment of the method.

The above storage medium may be a plurality of types of memory devices or storage devices, and may include: a mounting medium such as a CD-ROM, a floppy disk or a tape device; a computer system memory or a random access memory such as DRAM, DDR RAM, SRAM, EDO RAM , Rambus RAM, etc.; non-volatile memory, such as flash memory, magnetic media (such as hard disk or optical storage); registers or similar types of memory components, and the like. The storage medium may also include multiple types of memory or a combination of memories. Additionally, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being coupled to the first computer system via a network, such as the Internet. The second computer system can provide program instructions to the first computer for execution. The storage medium may also include two or more storage media that can reside in different locations (e.g., in different computer systems connected through a network). A storage medium may store program instructions, such as a computer program, executable by one or more processors.

Industrial applicability

The present disclosure provides a data padding method and apparatus, which can improve the filling efficiency of data missing values, and can ensure the validity of data filling, so that modeling or machine learning calculation is performed through the filled data, for example, by machine learning model calculation. When the user's credit overdue probability, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

Claims

A method of data filling, including:

Obtaining sample data and an objective function, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function The output target variable is the user's overdue probability;

Traversing the sample data according to the independent variable included in the objective function to obtain a traversal result;

Calculating a data deletion rate corresponding to the independent variable according to the traversal result;

According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:

When the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values by means of label grouping padding.
The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:

When the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value;

When there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to fill the missing value of the sample data corresponding to the independent variable. .
The method according to claim 3, wherein after determining whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, the method further comprises:

When the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, determining whether the non-missing value is significantly correlated with the target variable;

When the non-missing value is significantly correlated with the target variable, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. .
The method according to claim 4, wherein after determining whether the non-missing value is significantly correlated with the target variable, the method further comprises:

When the non-missing value is not significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by using a label grouping padding method.
The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:

When the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable;

When the non-missing value is significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of logistic regression.
The method according to claim 6, wherein after determining whether the non-missing value in the sample data is significantly correlated with the target variable, the method further comprises:

When the non-missing value is not significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding.
The method according to any one of claims 1 to 7, wherein the data deletion rate is The missing rate interval of the genus adopts the corresponding data filling method, and after the missing value is filled in the sample data corresponding to the independent variable, the method further includes:

Calculating a weight value of the independent variable in the objective function, and determining a trust index of the subsequent calculation result according to the weight value and the padded data.
A data filling device comprising:

An acquisition module, configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable The output target variable of the objective function is a user's overdue probability;

a missing rate calculation module, configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;

The data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
The apparatus of claim 9 wherein said data padding module is configured to:

When the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values by means of label grouping padding.
The apparatus of claim 9 wherein said data padding module is configured to:

When the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value;

When the target variable corresponding to the missing value in the sample data has no significant target variable corresponding to the non-missing value When the difference is found, the data is randomly extracted from the non-missing values, and the missing sample values are filled in the sample data corresponding to the independent variable.
The apparatus according to claim 11, wherein the data padding module is further configured to: after determining whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, when When there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable;

When the non-missing value is significantly correlated with the target variable, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. .
The apparatus according to claim 12, wherein said data padding module is further configured to: when it is determined whether said non-missing value is significantly correlated with a target variable, and when said non-missing value is not significantly correlated with the target variable, Then, the sample data corresponding to the independent variable is padded with missing values by using a label grouping filling method.
The apparatus of claim 9, wherein the data padding module is further configured to:

When the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable;

When the non-missing value is significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of logistic regression.
The apparatus of claim 14, wherein the data padding module is further configured to: after determining whether the non-missing value in the sample data is significantly correlated with the target variable, when the non-missing value is not significantly correlated with the target variable Then, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding.
A computer readable storage medium storing computer executable instructions, the computer Execution of instructions for performing the method of any of claims 1-8.