WO2019075599A1 - Data filling method and device - Google Patents

Data filling method and device Download PDF

Info

Publication number
WO2019075599A1
WO2019075599A1 PCT/CN2017/106280 CN2017106280W WO2019075599A1 WO 2019075599 A1 WO2019075599 A1 WO 2019075599A1 CN 2017106280 W CN2017106280 W CN 2017106280W WO 2019075599 A1 WO2019075599 A1 WO 2019075599A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
missing
padding
missing value
sample data
Prior art date
Application number
PCT/CN2017/106280
Other languages
French (fr)
Chinese (zh)
Inventor
赵敏
林磊
Original Assignee
深圳乐信软件技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳乐信软件技术有限公司 filed Critical 深圳乐信软件技术有限公司
Priority to CN201780039488.0A priority Critical patent/CN109564641B/en
Priority to PCT/CN2017/106280 priority patent/WO2019075599A1/en
Publication of WO2019075599A1 publication Critical patent/WO2019075599A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present disclosure relates to the field of data processing technologies, for example, to a data padding method and apparatus.
  • missing data may carry useful or critical information, if not missing If the values are properly processed, the missing data may affect the construction of subsequent models, such as the construction of models such as logistic regression and neural networks, and reduce the training effect of the machine learning model.
  • the corresponding machine learning model In the field of e-commerce, when evaluating the credit of users, the corresponding machine learning model is usually used to calculate the overdue probability of the user, and then the credit of the user is evaluated. If there is data missing in the user sample data during machine training, the training may be made. Obtaining the machine learning model cannot accurately calculate the user's overdue probability, which makes it impossible to provide the user with a highly matched service, such as adjusting the user's credit limit. In the related art, the missing value is usually filled by manual filling. Large amounts, low efficiency, and relying on human experience cannot guarantee the validity of the data being filled.
  • the present disclosure provides a data padding method and apparatus, which can improve the efficiency of data padding.
  • This embodiment provides a data padding method, which may include:
  • sample data includes wage income, working time Data corresponding to at least one parameter in the repayment record, the objective function having the at least one parameter as an independent variable, and an output target variable of the objective function being a user's overdue probability;
  • Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the embodiment further provides a data filling device, which may include:
  • An acquisition module configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable
  • the output target variable of the objective function is a user's overdue probability
  • a missing rate calculation module configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;
  • the data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different
  • the data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the above methods.
  • This embodiment also provides a data processing device, the data processing device including one or more processes And a memory and one or more programs, the one or more programs being stored in a memory, and when executed by one or more processors, performing any of the methods described above.
  • the embodiment further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.
  • This embodiment can improve the filling efficiency of the data missing value, and can ensure the validity of the data filling, so that the filled data can be calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model. It can improve the accuracy of the calculation results of overdue probability, and thus provide users with higher matching services.
  • FIG. 1 is a flowchart of a data padding method according to an embodiment
  • FIG. 3A is a flowchart of another data padding method according to an embodiment
  • FIG. 3B is a graph showing a BETA distribution corresponding to different parameter values ⁇ and ⁇ according to an embodiment
  • FIG. 5 is a structural block diagram of a data missing value filling apparatus according to an embodiment
  • FIG. 6 is a schematic structural diagram of hardware of a data processing device according to an embodiment.
  • FIG. 1 is a flowchart of a data padding method provided by this embodiment. This embodiment is applicable to a case where padding data is padded.
  • the method may be performed by a computing device such as a computer, and the method may be performed by a data padding device.
  • the data filling device can be implemented in at least one of software and hardware. As shown in FIG. 1, the method provided in this embodiment may include the following steps:
  • step 110 the sample data and the objective function are acquired, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, and the objective function takes the at least one parameter as an independent variable.
  • the output target variable of the objective function is the user's overdue probability.
  • the sample data may also be called original data, and the objective function may include a logistic regression model function and a neural network model function.
  • the target variable output by the logic function may be a user's repayment overdue probability, referred to as an overdue probability, and the original data may be a predicted user.
  • the sample data of the overdue probability for example, the sample data may include information such as the user's salary income, working years, and the user's repayment record, and the sample data may be referred to as an independent variable.
  • Missing data can be referred to as missing values, and missing values represent the data content of some of the missing data in the acquired raw data (such as big data). The existence of missing values in the original data may lead to the use of the corresponding objective function for modeling or learning training, which makes the establishment of the model biased, and the learning training effect is not satisfactory.
  • the missing value may be caused by mechanical reasons (such as data loss caused by data collection or preservation) or human reasons (such as staff's subjective errors or historical limitations).
  • the missing values can be divided into completely random missing (the missing data is random, the missing data does not depend on any incomplete or complete variables), random missing (the missing data is not completely random) , that is, the lack of such data depends on other complete variables) and the complete non-random deletion (meaning that the lack of data depends on the incomplete variable itself).
  • the missing values can be classified as single-valued deletions (the same attribute of the missing value) and any missing (the attribute of the missing value is different).
  • step 120 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • the data missing rate in the raw data can be determined by a code program.
  • a logistic regression model function consists of seven independent variables, each of which contains multiple data, which are read sequentially by the program. When the return value is null, the data is missing, the number of missing data is increased by 1, and after all the data is traversed in turn, the data missing rate of the original data can be counted.
  • the sample data includes 100 user information, 70 people's salary information, and the remaining 30 people's salary information is missing.
  • the data loss rate corresponding to the independent variable of salary information is 30%, and the salary information of these 30 people needs to be filled. .
  • step 130 according to the missing rate interval to which the data missing rate belongs, a corresponding data filling manner is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data.
  • the method of filling includes at least two types of label group filling, beta BETA distribution filling, random extraction filling, logistic regression filling, and mean filling.
  • the data missing information can be automatically filled according to the data deletion rate determined in step 120 to complete the filling of the missing data.
  • the objective function may be a function for calculating the expected probability of the user, traversing the original data containing the user information according to variables involved in the objective function, such as the user's salary income, working years, and the user's repayment record, according to each variable Traversing the result, calculating the data missing rate of the variable, taking the data filling method according to the data missing rate, and filling the missing sample data in the variable to ensure the integrity of the sample data.
  • a data abnormality alarm may be issued, and the alarm content may be “recommended manual detection”, or the original data may be directly discarded; when the data missing rate is compared Low, that is, most of the data is complete and only a small part of the data is missing. If the data missing rate is less than 5%, the data can be filled by logistic regression; when the data missing rate is in the (70%, 99%) interval When the missing rate is in the (5%, 70%) interval, the missing value can be filled by the method of filling the BETA distribution.
  • the original data is reasonably reserved, and the problem of the amount of data being dropped due to the complete deletion of the data content due to the absence of one or a part of the variables is avoided, according to different data.
  • the missing rate adopts different data filling methods. When the original information and attributes of the missing value part are retained, the distribution of the data and the attribute of the missing data are reduced, and the data filling can be automatically performed, and the data filling efficiency is improved. Reduce the labor burden.
  • data missing values can be filled by deleting data records, mean padding or manual padding.
  • the method of deleting data records when the sample size is small and the training model data is insufficient, the overall training effect of the model will be seriously affected; if the mean value filling method is adopted, the data loss rate will be serious if the data missing rate is high. Affecting the distribution state of the original non-missing value, the original non-missing value distribution is gathered at a certain point. For the non-randomness missing, after filling, the information covered by the missing value will be hidden; the defect of the manual filling method is that In a large data environment with large data volume, manual filling is heavy and inefficient, and it relies heavily on human experience and is not suitable for machine learning environments.
  • the embodiment provides a data padding method, which determines the data missing rate of the sample data by acquiring the original data with the missing data and the objective function, and adopts a corresponding data padding method to perform data deletion according to the size of the data missing rate.
  • the padding of the data includes at least one of tag group padding, BETA distribution padding, random padding padding, logistic regression padding, and mean padding, which improves the filling efficiency of data missing values and ensures the validity of data padding. Therefore, when the filled data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service. .
  • FIG. 2 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 2, the method may include the following steps:
  • step 210 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of wage income, working time, and repayment history, and the objective function takes the at least one parameter as an independent variable, and the objective function
  • the output target variable is the user's overdue probability.
  • step 220 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 230 when the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values in a manner of label group padding.
  • the data loss rate is greater than 70% and less than 99% is a serious data loss.
  • the label grouping method can improve the data filling efficiency.
  • the data can be padded by two sets of markings (1/0), as shown in Table 1:
  • variable X1 if the user's data of the user numbers 001 and 004 is missing, a corresponding dummy variable (X11) can be added accordingly, and the 001 user and the 004 user are assigned the value 1 in the X11, the user 002 and If the value of the X1 variable of the user 003 is not missing, the user 002 and the user 003 are both assigned 0 in X11, and the padding of the missing value is completed.
  • variables with a high deletion rate eg, a deletion rate greater than 99%
  • the data missing value padding is performed by means of label group padding, that is, the label grouping is used in the case where the data missing rate is high. Ways to improve the efficiency of data filling.
  • FIG. 3A is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 3A, the method provided in this embodiment may include the following steps:
  • step 310 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
  • step 320 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 330 when the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, if otherwise, the step is performed. 340, if yes, perform step 350.
  • Correlation refers to the monotonic relationship between the variable and the target variable.
  • the spearman correlation function can be used for correlation judgment.
  • the spearman is a non-parametric statistical method, which does not depend on the distribution of variables, that is, whether the non-missing value is a normal distribution or Non-normal distribution can find the degree and direction of the relationship between the non-missing value and the objective function.
  • the Spearman's rank correlation coefficient (Spearman's rank correlation coefficient) is calculated.
  • the Spearman coefficient can reflect the degree of correlation between the non-missing value (that is, the above variable) and the target variable. Close to 1 or -1, the greater the degree of correlation, where the Spearman coefficient is positive for positive correlation and negative for negative correlation.
  • the threshold range of the Spearman coefficient can be set. If the variable and the target variable Spearman coefficient satisfy the set threshold range, it is significantly correlated. When the variable and the target variable Spearman coefficient do not satisfy the set threshold range, it is non-significant correlation.
  • step 340 randomly extracting data from the non-missing values performs padding of missing values on sample data corresponding to the independent variable.
  • the data is randomly extracted in the non-missing value for padding.
  • step 350 it is determined whether the non-missing value is significantly related to the target variable, and if so, step 360 is performed, if otherwise, step 370 is performed.
  • step 360 a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is padded with missing values by using the BETA distribution.
  • the degree of difference refers to the degree of difference between the target variable corresponding to the missing value and the target variable corresponding to the non-missing value, and the degree of the difference may be determined according to the analysis of variance, for example, the expected probability of multiple users who have salary information and no wages.
  • the overdue probability of the plurality of users of the information is separately calculated by the variance, and the degree of the difference is determined based on the result of the variance calculation.
  • a left-right or right-biased distribution can be formed by adjusting the parameters ⁇ and ⁇ in the BETA distribution, that is, constructing a left-right or right-biased BETA distribution within a range of values of the non-missing value.
  • the bias of the BETA distribution is related to the missing value part, the non-missing part and the target variable of the variable. turn off.
  • the skewness of the BETA distribution is determined by the correlation between the missing value part and the target variable. For example, the greater the degree of correlation, the greater the skewness of the left or right deviation of the BEAT distribution, and the randomly generated value used to fill the missing value is extreme. The higher the probability of a value, the extreme value can be understood as the maximum or minimum value, or the value within the data range containing the maximum or minimum value.
  • the average value of the BETA distribution AVG ⁇ / ( ⁇ + ⁇ )
  • the variance of the BETA distribution VAR ⁇ * ⁇ / (( ⁇ + ⁇ ) ⁇ 2 * ( ⁇ + ⁇ + 1)), derived therefrom ( Where r is the intermediate variable):
  • ⁇ and ⁇ are parameters that jointly determine the morphology of the BETA distribution.
  • ⁇ > ⁇ the probability that the missing value takes a small value is large, that is, the distribution pattern is right-biased, and when ⁇ , the value of the missing value is large.
  • FIG. 3B is a graph of BETA distribution corresponding to different parameter values ⁇ and ⁇ provided by the present embodiment.
  • the non-missing value and the correlation ⁇ of the target variable, P50, MAX and MIN in the non-missing value jointly determine ⁇ and ⁇ in the estimated value distribution corresponding to the missing value, and further Determine the shape of the BETA distribution.
  • a new average value New_AVG is constructed by P50, MAX and MIN in the non-missing value, and ⁇ and ⁇ are calculated by New_AVG and VAR of the non-missing value part, wherein New_AVG is calculated as follows:
  • New_AVG (MAX-P50)*
  • New_AVG P50-(P50-MIN)*
  • step 370 the sample data corresponding to the argument is padded with missing values in a label grouping manner.
  • the embodiment provides a data filling method, improves the filling efficiency of the missing data value, and ensures the validity of the data filling, so that the filled data is calculated by modeling or machine learning, for example, by a machine learning model.
  • the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
  • FIG. 4 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 4, the method provided in this embodiment may include the following steps:
  • step 410 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
  • step 420 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 430 when the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable, if otherwise, step 440 is performed, and if yes, step 450 is performed.
  • step 440 the sample data corresponding to the argument is padded with missing values in a mean padding manner.
  • the mean padding refers to calculating the mean value of the non-missing part of the variable, and filling the mean value into the missing value part.
  • the mean can also be replaced by a median or a mode.
  • step 450 the missing sample values are filled in the sample data corresponding to the independent variable by means of logistic regression.
  • This embodiment provides a data padding method, which improves the filling efficiency of data missing values, so that the filled data is more accurate when performing modeling or machine learning calculations.
  • the method further includes: calculating a weight value of the variable in the original data, according to the weight value and the filling Data, a trust index that determines the results of subsequent calculations based on data that has been filled with missing data values.
  • the missing missing values are recorded by corresponding data, and the subsequent related computing models are calculated according to the filled data. After the result, the trust index of the result can be given.
  • a logistic regression model has seven independent variables X1-X7, and the weight value (% of importance) of each independent variable can be estimated indirectly by Wald statistic (Wald ChiSq).
  • the trust index may be the sum of the weight values of the individual independent variables that are not missing, and the statistical process and statistical results are shown in Table 2:
  • whether the data is discarded may be determined according to the level of the trust index.
  • machined learning is performed on the padded data with a trust index greater than 60% to improve learning efficiency while achieving better learning outcomes.
  • FIG. 5 is a structural block diagram of a data missing value padding apparatus according to the embodiment.
  • the device can perform the data padding method provided by the foregoing embodiment, and has the corresponding functional modules and beneficial effects of the execution method.
  • the apparatus may specifically include: an obtaining module 501, a missing rate calculating module 502, and a data filling module 503.
  • the obtaining module 501 is configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, and the objective function is the at least one parameter As an independent variable, the output target variable of the objective function is the user's overdue probability.
  • the missing rate calculation module 502 is configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculate a data missing rate corresponding to the independent variable according to the traversal result.
  • the data padding module 503 is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different
  • the method of data padding includes at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the data missing rate of the original data is determined by acquiring the original data with the missing data and the objective function, and the data missing method is used to fill the data missing value according to the size of the data missing rate.
  • the data filling method includes at least one of label group filling, BETA distribution filling, random extraction filling, logistic regression filling, and mean filling, which improves the filling efficiency of data missing values, and can ensure the validity of data filling, so that the filling is completed.
  • the data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
  • the data padding module 503 is configured to: if the data missing rate is greater than 70% and less than 99%, perform data missing value padding by using a label grouping padding manner.
  • the data padding module 503 is configured to: if the data deletion rate is greater than 5% and less than or equal to 70%, determine whether the target variable corresponding to the missing value corresponding to the missing value in the sample data is There is a significant difference; when there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to perform sample data corresponding to the independent variable. Filling in missing values.
  • the data padding module 503 is further configured to: when the missing value in the sample data corresponds to When there is a significant difference between the target variable and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable; when the non-missing value is significantly correlated with the target variable, according to the relevant direction and degree of difference A left-right or right-biased BETA distribution is constructed, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. If the non-missing value is not significantly correlated with the target variable, the tag grouping filler is used to perform missing value padding on the sample data corresponding to the independent variable.
  • the data padding module 503 is configured to: if the data missing rate is less than 5%, determine whether the non-missing value in the sample data is significantly correlated with the target variable; if the non-missing value in the sample data If there is no significant correlation with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding. If the non-missing value in the sample data is significantly correlated with the target variable, the logistic regression method is used. The sample data corresponding to the independent variable is filled with missing values.
  • the device may further include a padding result evaluation module, configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable.
  • a padding result evaluation module configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable.
  • the embodiment further provides a storage medium comprising computer executable instructions for performing a data padding method when executed by a computer processor, the method comprising:
  • sample data and an objective function wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function
  • the output target variable is the user's overdue probability
  • Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the computer-executable instructions can also perform any of the data padding methods provided by the foregoing embodiments when executed by the computer processor.
  • the flow of the method provided by the foregoing embodiments may be referred to.
  • the present embodiment further provides a data processing device, which may be a filler, as shown in FIG. 6 , which is a hardware structure diagram of a data processing device provided by this embodiment, where the data processing device may include: A processor 610 and a memory 620; may further include a communication interface 630 and a bus 640.
  • the processor 610, the memory 620, and the communication interface 630 can complete communication with each other through the bus 640.
  • Communication interface 630 can be used for information transmission.
  • Processor 610 can invoke logic instructions in memory 620 to perform any of the methods of the above-described embodiments.
  • the memory 620 can include a storage program area and a storage data area, and the storage program area can store an operating system and an application required for at least one function.
  • the storage data area can store data and the like created according to the use of the data processing device.
  • the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.
  • the logic instructions in memory 620 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium.
  • the technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) Performing the method of the embodiment All or part of the steps.
  • All or part of the processes in the foregoing embodiment may be completed by a computer program indicating related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, may include the above The flow of an embodiment of the method.
  • the above storage medium may be a plurality of types of memory devices or storage devices, and may include: a mounting medium such as a CD-ROM, a floppy disk or a tape device; a computer system memory or a random access memory such as DRAM, DDR RAM, SRAM, EDO RAM , Rambus RAM, etc.; non-volatile memory, such as flash memory, magnetic media (such as hard disk or optical storage); registers or similar types of memory components, and the like.
  • the storage medium may also include multiple types of memory or a combination of memories. Additionally, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being coupled to the first computer system via a network, such as the Internet.
  • the second computer system can provide program instructions to the first computer for execution.
  • the storage medium may also include two or more storage media that can reside in different locations (e.g., in different computer systems connected through a network).
  • a storage medium may store program instructions, such as a computer program, executable by one or more processors.
  • the present disclosure provides a data padding method and apparatus, which can improve the filling efficiency of data missing values, and can ensure the validity of data filling, so that modeling or machine learning calculation is performed through the filled data, for example, by machine learning model calculation.
  • the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

Abstract

A data filling method and device. The method may comprise: acquiring sample data and an objective function, wherein the sample data comprises data corresponding to at least one parameter of salary income, working time and repayment records, the objective function takes the at least one parameter as an independent variable, and the output target variable of the objective function is a user's probability of being overdue; traversing the sample data according to the independent variable included in the objective function to obtain a traversal result; calculating, according to the traversal result, a data missing rate corresponding to the independent variable; and filling in, according to a missing rate interval that the data missing rate belongs to, a missing value in the sample data corresponding to the independent variable by using the corresponding data filling method, wherein different missing rate intervals correspond to different data filling methods, and the data filling methods comprise at least two of tag grouping filling, BETA distribution filling, random drawing filling, logistic regression filling and mean filling.

Description

数据填补方法和装置Data filling method and device 技术领域Technical field
本公开涉及数据处理技术领域,例如涉及一种数据填补方法和装置。The present disclosure relates to the field of data processing technologies, for example, to a data padding method and apparatus.
背景技术Background technique
在大数据环境中,由于数据来源和数据产生方法的多样化,在很多数据应用场景中可能会出现数据的数值缺失的情况,而缺失的数据可能携带有用的或者关键的信息,若未对缺失的数值进行恰当的处理,则存在数值缺失的数据可能对后续模型的构建,如对逻辑回归和神经网络等模型的构建产生影响降低机器学习模型的训练效果。In the big data environment, due to the diversification of data sources and data generation methods, in many data application scenarios, data loss may occur, and missing data may carry useful or critical information, if not missing If the values are properly processed, the missing data may affect the construction of subsequent models, such as the construction of models such as logistic regression and neural networks, and reduce the training effect of the machine learning model.
在电子商务领域,对用户的信用评价时,通常采用相应的机器学习模型计算用户的逾期概率,进而对用户的信用进行评价,若进行机器训练时的用户样本数据存在数据缺失,则可能使得训练得到机器学习模型无法准确地计算用户的逾期概率,导致无法为用户提供匹配度较高的服务,如调整用户的信用额度,相关技术中,通常采用人工填补的方式对缺失的数值进行填补,工作量大、效率低,并且依赖于人的经验,无法保证所填补数据的有效性。In the field of e-commerce, when evaluating the credit of users, the corresponding machine learning model is usually used to calculate the overdue probability of the user, and then the credit of the user is evaluated. If there is data missing in the user sample data during machine training, the training may be made. Obtaining the machine learning model cannot accurately calculate the user's overdue probability, which makes it impossible to provide the user with a highly matched service, such as adjusting the user's credit limit. In the related art, the missing value is usually filled by manual filling. Large amounts, low efficiency, and relying on human experience cannot guarantee the validity of the data being filled.
发明内容Summary of the invention
本公开提供了一种数据填补方法和装置,可以实现提高数据填补的效率。本实施例提供了一种数据填补方法,可以包括:The present disclosure provides a data padding method and apparatus, which can improve the efficiency of data padding. This embodiment provides a data padding method, which may include:
获取样本数据与目标函数,其中,所述样本数据包括工资收入、工作时间 和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率;Obtaining sample data and an objective function, wherein the sample data includes wage income, working time Data corresponding to at least one parameter in the repayment record, the objective function having the at least one parameter as an independent variable, and an output target variable of the objective function being a user's overdue probability;
根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;Traversing the sample data according to the independent variable included in the objective function to obtain a traversal result;
根据所述遍历结果,计算所述自变量对应的数据缺失率;Calculating a data deletion rate corresponding to the independent variable according to the traversal result;
依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。本实施例还提供了一种数据填补装置,可以包括:According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding. The embodiment further provides a data filling device, which may include:
获取模块,设置为获取样本数据和目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率;An acquisition module, configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable The output target variable of the objective function is a user's overdue probability;
缺失率计算模块,设置为根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率;a missing rate calculation module, configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;
数据填补模块,设置为依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。The data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
本实施例还提供一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行上述任意一种方法。The embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the above methods.
本实施例还提供一种数据处理设备,该数据处理设备包括一个或多个处理 器、存储器以及一个或多个程序,所述一个或多个程序存储在存储器中,当被一个或多个处理器执行时,执行上述任意一种方法。This embodiment also provides a data processing device, the data processing device including one or more processes And a memory and one or more programs, the one or more programs being stored in a memory, and when executed by one or more processors, performing any of the methods described above.
本实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述任意一种方法。The embodiment further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.
本实施例能够提高了数据缺失值的填补效率,并能够保证数据填补的有效性,使得通过填补后的数据在进行建模或机器学习等计算,例如通过机器学习模型计算用户的信用逾期概率时,能够提高逾期概率计算结果的准确性,进而为用户提供匹配度较高的服务。This embodiment can improve the filling efficiency of the data missing value, and can ensure the validity of the data filling, so that the filled data can be calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model. It can improve the accuracy of the calculation results of overdue probability, and thus provide users with higher matching services.
附图说明DRAWINGS
图1是一实施例提供的一种数据填补方法的流程图;FIG. 1 is a flowchart of a data padding method according to an embodiment;
图2是一实施例提供的另一种数据填补方法的流程图;2 is a flowchart of another data padding method according to an embodiment;
图3A是一实施例提供的另一种数据填补方法的流程图;FIG. 3A is a flowchart of another data padding method according to an embodiment; FIG.
图3B是一实施例提供的不同参数值α和β对应的BETA分布曲线图;FIG. 3B is a graph showing a BETA distribution corresponding to different parameter values α and β according to an embodiment; FIG.
图4是一实施例提供的另一种数据填补方法的流程图;4 is a flowchart of another data padding method according to an embodiment;
图5是一实施例提供的一种数据缺失值填补装置的结构框图;FIG. 5 is a structural block diagram of a data missing value filling apparatus according to an embodiment; FIG.
图6为一实施例提供的数据处理设备的硬件结构示意图。FIG. 6 is a schematic structural diagram of hardware of a data processing device according to an embodiment.
具体实施方式Detailed ways
图1是本实施例提供的一种数据填补方法的流程图,本实施例可适用于对缺失数据进行填补的情况,该方法可以由计算设备如计算机来执行,该方法可由数据填补装置执行,数据填补装置可采用软件和硬件中的至少一种方式实现, 如图1所示,本实施例提供的方法可以包括以下步骤:1 is a flowchart of a data padding method provided by this embodiment. This embodiment is applicable to a case where padding data is padded. The method may be performed by a computing device such as a computer, and the method may be performed by a data padding device. The data filling device can be implemented in at least one of software and hardware. As shown in FIG. 1, the method provided in this embodiment may include the following steps:
在步骤110中,获取样本数据与目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率。In step 110, the sample data and the objective function are acquired, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, and the objective function takes the at least one parameter as an independent variable. The output target variable of the objective function is the user's overdue probability.
其中,样本数据也可以叫做原始数据,目标函数可以包括逻辑回归模型函数和神经网络模型函数等,逻辑函数输出的目标变量可以为用户的还款逾期概率,简称逾期概率,原始数据可以为预测用户逾期概率的样本数据,例如,样本数据可以包括用户的工资收入、工作年限和用户的还款记录等信息,样本数据可以称作自变量。缺失的数据可称作缺失值,缺失值表示获取到的原始数据(如大数据)中缺失的部分数据的数据内容。原始数据中存在缺失值可能导致使用对应的目标函数进行建模或学习训练时,使得模型的建立产生偏差,以及学习训练效果不理想。The sample data may also be called original data, and the objective function may include a logistic regression model function and a neural network model function. The target variable output by the logic function may be a user's repayment overdue probability, referred to as an overdue probability, and the original data may be a predicted user. The sample data of the overdue probability, for example, the sample data may include information such as the user's salary income, working years, and the user's repayment record, and the sample data may be referred to as an independent variable. Missing data can be referred to as missing values, and missing values represent the data content of some of the missing data in the acquired raw data (such as big data). The existence of missing values in the original data may lead to the use of the corresponding objective function for modeling or learning training, which makes the establishment of the model biased, and the learning training effect is not satisfactory.
其中,缺失值的产生原因可以是机械原因(如数据采集或保存过程中造成的数据丢失)或者人为原因(如工作人员的主观失误或历史局限等)。根据缺失值的分布,可将缺失值分为完全随机性缺失(指数据的缺失是随机的,数据的缺失不依赖于任何不完全变量或完全变量)、随机缺失(指数据的缺失不是完全随机的,即该类数据的缺失依赖于其他完全变量)和完全非随机缺失(指数据的缺失依赖于不完全变量自身)。根据缺失值的属性,可将缺失值分类为单值缺失(缺失值的属性相同)和任意缺失(缺失值的属性不同)。Among them, the missing value may be caused by mechanical reasons (such as data loss caused by data collection or preservation) or human reasons (such as staff's subjective errors or historical limitations). According to the distribution of missing values, the missing values can be divided into completely random missing (the missing data is random, the missing data does not depend on any incomplete or complete variables), random missing (the missing data is not completely random) , that is, the lack of such data depends on other complete variables) and the complete non-random deletion (meaning that the lack of data depends on the incomplete variable itself). According to the attributes of the missing values, the missing values can be classified as single-valued deletions (the same attribute of the missing value) and any missing (the attribute of the missing value is different).
在步骤120中,根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率。In step 120, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
可通过代码程序确定原始数据中的数据缺失率。例如,逻辑回归模型函数包括7个自变量,每个自变量包含多个数据,通过程序依次读取这些数据,当 返回值为空时表示该数据缺失,缺失数据的数量加1,依次遍历所有数据后,可统计出原始数据的数据缺失率。The data missing rate in the raw data can be determined by a code program. For example, a logistic regression model function consists of seven independent variables, each of which contains multiple data, which are read sequentially by the program. When the return value is null, the data is missing, the number of missing data is increased by 1, and after all the data is traversed in turn, the data missing rate of the original data can be counted.
例如,样本数据中包括100个用户的信息,有70个人的工资信息,其余30人的工资信息缺失,工资信息这个自变量对应的数据缺失率则为30%,需要填补这30个人的工资信息。For example, the sample data includes 100 user information, 70 people's salary information, and the remaining 30 people's salary information is missing. The data loss rate corresponding to the independent variable of salary information is 30%, and the salary information of these 30 people needs to be filled. .
在步骤130中,依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。In step 130, according to the missing rate interval to which the data missing rate belongs, a corresponding data filling manner is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data. The method of filling, the data filling method includes at least two types of label group filling, beta BETA distribution filling, random extraction filling, logistic regression filling, and mean filling.
可以根据步骤120确定的数据缺失率自动采取相应的数据填补方式完成数据缺失信息的填补。目标函数可以为计算用户预期概率的函数,根据目标函数中涉及的变量,如用户的工资收入、工作年限和用户的还款记录等变量信息,遍历包含用户信息的原始数据,根据每一个变量的遍历结果,计算出该变量的数据缺失率,根据数据缺失率采取响应的数据填补方式,对该变量中缺失的样本数据进行填补,以保证样本数据的完整性。The data missing information can be automatically filled according to the data deletion rate determined in step 120 to complete the filling of the missing data. The objective function may be a function for calculating the expected probability of the user, traversing the original data containing the user information according to variables involved in the objective function, such as the user's salary income, working years, and the user's repayment record, according to each variable Traversing the result, calculating the data missing rate of the variable, taking the data filling method according to the data missing rate, and filling the missing sample data in the variable to ensure the integrity of the sample data.
可选地,当数据缺失率较高,如达到99%以上时,可发出数据异常警报,警报内容可以是“建议人工检测”,或者直接对此部分原始数据进行弃用;当数据缺失率较低,即大部分数据都是完整的仅有小部分数据缺失,如数据缺失率小于5%,则可以采用逻辑回归填补的方式进行数据填补;当数据缺失率在(70%,99%]区间时,可以采用标签分组填补的方式进行缺失值填补,当缺失率在(5%,70%]区间时,可采用BETA分布填补的方式进行缺失值填补。Optionally, when the data missing rate is high, such as when it reaches 99% or more, a data abnormality alarm may be issued, and the alarm content may be “recommended manual detection”, or the original data may be directly discarded; when the data missing rate is compared Low, that is, most of the data is complete and only a small part of the data is missing. If the data missing rate is less than 5%, the data can be filled by logistic regression; when the data missing rate is in the (70%, 99%) interval When the missing rate is in the (5%, 70%) interval, the missing value can be filled by the method of filling the BETA distribution.
在本实施例中,对原始数据进行了合理保留,避免了由于数据内容因一个或一部分变量的缺失而被完全删除导致的数据量下降的问题,根据不同的数据 缺失率采取不同的数据填补方式,在保留缺失值部分原有的信息和属性的情况下,减少对无缺失值部分数据的分布和属性的破坏,能够自动进行数据填补,提高数据填补效率,并减轻了人工负担。In this embodiment, the original data is reasonably reserved, and the problem of the amount of data being dropped due to the complete deletion of the data content due to the absence of one or a part of the variables is avoided, according to different data. The missing rate adopts different data filling methods. When the original information and attributes of the missing value part are retained, the distribution of the data and the attribute of the missing data are reduced, and the data filling can be automatically performed, and the data filling efficiency is improved. Reduce the labor burden.
相关技术中,可采用删除数据记录、均值填补或人工填补的方式进行数据缺失值的填补。采用删除数据记录的方式时,在样本量较少,训练模型的数据不足时会严重影响模型的总体训练的效果;若采用均值填补的方式,则在数据缺失率较高的情况下,会严重影响原有非缺失值的分布状态,导致原非缺失值分布聚集在某个点上,针对非随机性的缺失,填补以后,将会隐藏缺失值涵盖的信息;人工填补的方式的缺陷在于,在数据量大的大数据环境中,人工填补工作量大、效率低,且很大程度上依赖于人的经验,不适合机器学习环境。In the related art, data missing values can be filled by deleting data records, mean padding or manual padding. When the method of deleting data records is adopted, when the sample size is small and the training model data is insufficient, the overall training effect of the model will be seriously affected; if the mean value filling method is adopted, the data loss rate will be serious if the data missing rate is high. Affecting the distribution state of the original non-missing value, the original non-missing value distribution is gathered at a certain point. For the non-randomness missing, after filling, the information covered by the missing value will be hidden; the defect of the manual filling method is that In a large data environment with large data volume, manual filling is heavy and inefficient, and it relies heavily on human experience and is not suitable for machine learning environments.
本实施例提供了一种数据填补方法,通过获取存在数据缺失的原始数据以及目标函数,确定所述样本数据的数据缺失率,依据所述数据缺失率的大小采取相应的数据填补方式进行数据缺失值的填补,所述数据填补方式包括标签分组填补、BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少一种,提高了数据缺失值的填补效率,并能够保证数据填补的有效性,使得通过填补后的数据在进行建模或机器学习等计算,例如通过机器学习模型计算用户的信用逾期概率时,能够提高逾期概率计算结果的准确性,进而为用户提供匹配度较高的服务。The embodiment provides a data padding method, which determines the data missing rate of the sample data by acquiring the original data with the missing data and the objective function, and adopts a corresponding data padding method to perform data deletion according to the size of the data missing rate. The padding of the data includes at least one of tag group padding, BETA distribution padding, random padding padding, logistic regression padding, and mean padding, which improves the filling efficiency of data missing values and ensures the validity of data padding. Therefore, when the filled data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service. .
图2是本实施例提供的另一种数据填补方法的流程图,如图2所示,该方法可以包括如下步骤:FIG. 2 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 2, the method may include the following steps:
在步骤210中,获取样本数据与目标函数。In step 210, sample data and an objective function are obtained.
其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的 输出目标变量为用户的逾期概率。The sample data includes data corresponding to at least one parameter of wage income, working time, and repayment history, and the objective function takes the at least one parameter as an independent variable, and the objective function The output target variable is the user's overdue probability.
在步骤220中,根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率。In step 220, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
在步骤230中,当所述数据缺失率大于70%且小于99%时,则采用标签分组填补的方式对所述自变量对应的样本数据进行缺失值填补。In step 230, when the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values in a manner of label group padding.
数据缺失率大于70%且小于99%属于数据严重缺失的情况,当数据缺失严重时,利用标签分组填补方式,可以提高数据的填补效率。The data loss rate is greater than 70% and less than 99% is a serious data loss. When the data is seriously missing, the label grouping method can improve the data filling efficiency.
例如,可以采用分两组打标记(1/0)的方式进行数据的标记填补,如表1所示:For example, the data can be padded by two sets of markings (1/0), as shown in Table 1:
表1Table 1
用户编号user ID X1X1 X11X11
001001 .. 11
002002 0.90.9 00
003003 0.80.8 00
004004 .. 11
对变量X1而言,用户编号为001和004的用户的数据存在缺失,则可以相应的增添一个对应的哑变量(X11),并在X11中将001用户和004用户赋值为1,用户002和用户003的X1变量值非缺失,则在X11中将用户002和用户003均赋值为0,完成缺失值的填补。可选地,可直接将缺失率较高(如缺失率大于99%)的变量直接删除。For the variable X1, if the user's data of the user numbers 001 and 004 is missing, a corresponding dummy variable (X11) can be added accordingly, and the 001 user and the 004 user are assigned the value 1 in the X11, the user 002 and If the value of the X1 variable of the user 003 is not missing, the user 002 and the user 003 are both assigned 0 in X11, and the padding of the missing value is completed. Alternatively, variables with a high deletion rate (eg, a deletion rate greater than 99%) can be directly deleted.
本实施例提供的数据填补方法,如果所述数据缺失率大于70%且小于99%,则采用标签分组填补的方式进行数据缺失值填补,即在数据缺失率较高的情况下使用标签分组填补方式,提高了数据填补效率。 In the data padding method provided in this embodiment, if the data deletion rate is greater than 70% and less than 99%, the data missing value padding is performed by means of label group padding, that is, the label grouping is used in the case where the data missing rate is high. Ways to improve the efficiency of data filling.
图3A是本实施例提供的另一种数据填补方法的流程图,如图3A所示,本实施例提供的方法可以包括如下步骤:FIG. 3A is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 3A, the method provided in this embodiment may include the following steps:
在步骤310中,获取样本数据与目标函数。In step 310, sample data and an objective function are obtained.
其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率。The sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
在步骤320中,根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率。In step 320, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
在步骤330中,当所述数据缺失率大于5%且小于等于70%时,判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异,如果否则执行步骤340,如果是则执行步骤350。In step 330, when the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, if otherwise, the step is performed. 340, if yes, perform step 350.
相关性是指变量与目标变量的单调关系(monotonic relationships),可使用spearman相关函数进行相关性判断,spearman为非参数统计方法,不依赖于变量的分布,即不论非缺失值是正态分布还是非正态分布,均可求得非缺失值与目标函数的关联程度与方向。根据变量与目标变量的单调相关程度,计算斯皮尔曼等级相关系数(Spearman′s rank correlation coefficient),简称Spearman系数,Spearman系数可以反映非缺失值(即上述变量)与目标变量的相关程度,越接近1或-1,则相关程度越大,其中Spearman系数为正表示正相关,为负表示负相关。Correlation refers to the monotonic relationship between the variable and the target variable. The spearman correlation function can be used for correlation judgment. The spearman is a non-parametric statistical method, which does not depend on the distribution of variables, that is, whether the non-missing value is a normal distribution or Non-normal distribution can find the degree and direction of the relationship between the non-missing value and the objective function. According to the degree of monotonic correlation between the variable and the target variable, the Spearman's rank correlation coefficient (Spearman's rank correlation coefficient) is calculated. The Spearman coefficient can reflect the degree of correlation between the non-missing value (that is, the above variable) and the target variable. Close to 1 or -1, the greater the degree of correlation, where the Spearman coefficient is positive for positive correlation and negative for negative correlation.
可以设定Spearman系数的阈值范围,当变量与目标变量Spearman系数满足设定的阈值范围,则为显著相关,当变量与目标变量Spearman系数不满足设定的阈值范围,则为非显著相关。 The threshold range of the Spearman coefficient can be set. If the variable and the target variable Spearman coefficient satisfy the set threshold range, it is significantly correlated. When the variable and the target variable Spearman coefficient do not satisfy the set threshold range, it is non-significant correlation.
在步骤340中,在所述非缺失值中随机抽取数据对所述自变量对应的样本数据进行缺失值的填补。In step 340, randomly extracting data from the non-missing values performs padding of missing values on sample data corresponding to the independent variable.
当判断得出原始数据中非缺失值与目标变量非显著相关的情况下,则采取在非缺失值中随机抽取数据进行填补。When it is judged that the non-missing value in the original data is not significantly correlated with the target variable, the data is randomly extracted in the non-missing value for padding.
在步骤350中,判断所述非缺失值与目标变量是否显著相关,如果是则执行步骤360,如果否则执行步骤370。In step 350, it is determined whether the non-missing value is significantly related to the target variable, and if so, step 360 is performed, if otherwise, step 370 is performed.
如果原始数据中非缺失值与目标变量显著相关,则判断非缺失值是否与因变量显著相关。可以通过非缺失值以及目标变量建立单变量回归模型,如:Y=β0+β1X,Y表示目标变量,X标识非缺失值,根据该公式可计算得到β0和β1的数值,其中,若β1为0,则表示非缺失值和因变量无关,若β1不为0则意味着非缺失值和目标变量相关。If the non-missing value in the original data is significantly correlated with the target variable, it is determined whether the non-missing value is significantly correlated with the dependent variable. A univariate regression model can be established by non-missing values and target variables, such as: Y=β0+β1X, Y is the target variable, and X is the non-missing value. According to the formula, the values of β0 and β1 can be calculated, where β1 is 0, it means that the non-missing value is independent of the dependent variable. If β1 is not 0, it means that the non-missing value is related to the target variable.
在步骤360中,根据相关方向和和差异程度构建左偏或右偏的BETA分布,利用所述BETA分布对所述自变量对应的样本数据进行缺失值的填补。In step 360, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is padded with missing values by using the BETA distribution.
其中,差异程度是指缺失值对应的目标变量与非缺失值对应的目标变量的差异程度,可以根据方差分析来判断该差异程度,例如,将有工资信息的多个用户的预期概率和没有工资信息的多个用户的逾期概率分别进行方差计算,根据方差计算结果判断上述差异程度。The degree of difference refers to the degree of difference between the target variable corresponding to the missing value and the target variable corresponding to the non-missing value, and the degree of the difference may be determined according to the analysis of variance, for example, the expected probability of multiple users who have salary information and no wages. The overdue probability of the plurality of users of the information is separately calculated by the variance, and the degree of the difference is determined based on the result of the variance calculation.
例如,可通过调整BETA分布中的参数α和β形成左偏或右偏的分布,即在非缺失值的取值范围内构建左偏或右偏的BETA分布。For example, a left-right or right-biased distribution can be formed by adjusting the parameters α and β in the BETA distribution, that is, constructing a left-right or right-biased BETA distribution within a range of values of the non-missing value.
可选地,在非缺失值分布的极端部分,可采用随机分散的方法进行缺失值的填补。其中,上述极端部分可以理解为非缺失值中的最大值或最小值所在的数据范围。Alternatively, in the extreme part of the non-missing value distribution, a random dispersion method may be used to fill the missing values. The above extreme part can be understood as the data range in which the maximum or minimum value of the non-missing value is located.
其中,BETA分布的偏向与变量的缺失值部分、非缺失部分以及目标变量相 关。BETA分布的偏度大小由缺失值部分与目标变量的相关性决定,例如,相关程度越大,BEAT分布左偏或右偏的偏度越大,用以填补缺失值的随机生成的值为极端值的可能性越高,其中极端值可以理解为最大值或最小值,或者包含最大值或最小值的数据范围内的值。Among them, the bias of the BETA distribution is related to the missing value part, the non-missing part and the target variable of the variable. turn off. The skewness of the BETA distribution is determined by the correlation between the missing value part and the target variable. For example, the greater the degree of correlation, the greater the skewness of the left or right deviation of the BEAT distribution, and the randomly generated value used to fill the missing value is extreme. The higher the probability of a value, the extreme value can be understood as the maximum or minimum value, or the value within the data range containing the maximum or minimum value.
例如,BETA分布的平均值AVG=α/(α+β),BETA分布的方差VAR=α*β/((α+β)^2*(α+β+1)),由此推导出(其中r为中间变量):For example, the average value of the BETA distribution AVG = α / (α + β), the variance of the BETA distribution VAR = α * β / ((α + β) ^ 2 * (α + β + 1)), derived therefrom ( Where r is the intermediate variable):
r=(AVG*(1-AVG)/VAR)-1r=(AVG*(1-AVG)/VAR)-1
α=AVG*r==AVG*r
β=(1-AVG)*rβ=(1-AVG)*r
即α和β是共同决定BETA分布形态的参数,其中当β>α时,缺失值取小值的可能性大,即分布形态右偏,其中当β<α时,缺失值取大值的可能性大,即分布形态左偏,进而BETA分布的形状取决于AVG,由此可知,当AVG在非缺失值中的最小值MIN和中间值P50之间时,BETA分布取值大的可能性大,即左偏;当AVG在非缺失值中的中间值P50和最大值MAX之间时,BETA分布取值小的可能性大,即右偏。示例性的,不同α和β值对应的BETA分布曲线如图3B所示,图3B是本实施例提供的不同参数值α和β对应的BETA分布曲线图。That is, α and β are parameters that jointly determine the morphology of the BETA distribution. When β>α, the probability that the missing value takes a small value is large, that is, the distribution pattern is right-biased, and when β<α, the value of the missing value is large. The largeness, that is, the distribution pattern is left-biased, and the shape of the BETA distribution depends on the AVG. From this, it can be seen that when the AVG is between the minimum value MIN and the intermediate value P50 among the non-missing values, the probability of the BETA distribution being large is large. , that is, left-biased; when the AVG is between the intermediate value P50 and the maximum value MAX in the non-missing value, the probability that the BETA distribution is small is large, that is, right-biased. Illustratively, the BETA distribution curves corresponding to different α and β values are shown in FIG. 3B, and FIG. 3B is a graph of BETA distribution corresponding to different parameter values α and β provided by the present embodiment.
在本实施例中,可以采用依据缺失值、非缺失值与目标变量的相关度ρ,非缺失值中的P50、MAX和MIN来共同决定缺失值对应的估计值分布中的α和β,进而确定出BETA分布的形状。通过非缺失值中的P50、MAX和MIN构造出新平均值New_AVG,通过New_AVG以及非缺失值部分的VAR共同计算得到α和β,其中New_AVG的计算方式如下:In this embodiment, according to the missing value, the non-missing value and the correlation ρ of the target variable, P50, MAX and MIN in the non-missing value jointly determine α and β in the estimated value distribution corresponding to the missing value, and further Determine the shape of the BETA distribution. A new average value New_AVG is constructed by P50, MAX and MIN in the non-missing value, and α and β are calculated by New_AVG and VAR of the non-missing value part, wherein New_AVG is calculated as follows:
当缺失值取小值的可能性较大(即分布右偏时):When the missing value is more likely to take a small value (ie when the distribution is right-biased):
New_AVG=(MAX-P50)*|ρ|+P50; New_AVG=(MAX-P50)*|ρ|+P50;
当缺失值取大值的可能性较大(即分布左偏时):When the missing value is more likely to take a large value (ie when the distribution is left-biased):
New_AVG=P50-(P50-MIN)*|ρ|。New_AVG=P50-(P50-MIN)*|ρ|.
在步骤370中,采用标签分组填补方式对所述自变量对应的样本数据进行缺失值的填补。In step 370, the sample data corresponding to the argument is padded with missing values in a label grouping manner.
本实施例提供了一种数据填补方法,提高了数据缺失值的填补效率,并能够保证数据填补的有效性,使得填补完毕的数据在进行建模或机器学习等计算,例如通过机器学习模型计算用户的信用逾期概率时,能够提高逾期概率计算结果的准确性,进而为用户提供匹配度较高的服务。The embodiment provides a data filling method, improves the filling efficiency of the missing data value, and ensures the validity of the data filling, so that the filled data is calculated by modeling or machine learning, for example, by a machine learning model. When the user's credit overdue probability, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
图4是本实施例提供的另一种数据填补方法的流程图,如图4所示,本实施例提供的方法可以包括如下步骤:FIG. 4 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 4, the method provided in this embodiment may include the following steps:
在步骤410中,获取样本数据与目标函数。In step 410, sample data and an objective function are obtained.
其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率。The sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
在步骤420中,根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率。In step 420, the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
在步骤430中,当所述数据缺失率小于等于5%时,则判断所述样本数据中的非缺失值与目标变量是否显著相关,如果否则执行步骤440,如果是则执行步骤450。In step 430, when the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable, if otherwise, step 440 is performed, and if yes, step 450 is performed.
在步骤440中,采用均值填补的方式对所述自变量对应的样本数据进行缺失值的填补。In step 440, the sample data corresponding to the argument is padded with missing values in a mean padding manner.
其中,均值填补指对变量中非缺失的部分计算均值,将该均值填补入缺失值部分。可选地,均值也可用中位数或众数代替。 Wherein, the mean padding refers to calculating the mean value of the non-missing part of the variable, and filling the mean value into the missing value part. Alternatively, the mean can also be replaced by a median or a mode.
在步骤450中,采用逻辑回归填补的方式对所述自变量对应的样本数据进行缺失值的填补。In step 450, the missing sample values are filled in the sample data corresponding to the independent variable by means of logistic regression.
例如,通过非缺失值以及目标变量建立单变量回归模型,即通过非缺失值X与目标变量Y(逻辑回归中的log(P/1-P))建立单变量逻辑回归模型,计算得到β0(Intercept)和β1(Estimate),然后根据缺失部分的目标变量Y(Y在缺失部分的平均值)和上一步得出的β0与β1,推出缺失值的估计值1X=(Y-β0)/β1,将该1X值作为缺失值进行填充。For example, a univariate regression model is established by non-missing values and target variables, that is, a non-missing value X and a target variable Y (log(P/1-P) in logistic regression) are used to establish a univariate logistic regression model, and β0 is calculated ( Intercept) and β1 (Estimate), then based on the target variable Y (the average of Y in the missing part) of the missing part and β0 and β1 obtained in the previous step, the estimated value of the missing value is introduced 1X=(Y-β0)/β1 The 1X value is filled as a missing value.
本实施例提供了一种数据填补方法,提高了数据缺失值的填补效率,使得填补完毕的数据在进行建模或机器学习等计算时,得到的结果更精准。This embodiment provides a data padding method, which improves the filling efficiency of data missing values, so that the filled data is more accurate when performing modeling or machine learning calculations.
在上述内容的基础上,在依据所述数据缺失率的大小采取相应的数据填补方式进行数据缺失值的填补之后,还包括:计算原始数据中变量的权重值,依据所述权重值以及填补的数据,确定根据数据缺失值填补后的数据进行后续计算的结果的信任指数。On the basis of the foregoing content, after the data missing value is filled according to the size of the data missing rate, the method further includes: calculating a weight value of the variable in the original data, according to the weight value and the filling Data, a trust index that determines the results of subsequent calculations based on data that has been filled with missing data values.
通过本公开提供的数据填补方法对原始数据中的缺失值进行填补的过程中,被填补的缺失值会被相应的进行数据记录,当后续相关的计算模型根据填补后的数据进行计算产生了预测结果后,可给出该结果的信任指数。In the process of filling the missing values in the original data by the data padding method provided by the present disclosure, the missing missing values are recorded by corresponding data, and the subsequent related computing models are calculated according to the filled data. After the result, the trust index of the result can be given.
例如,一个逻辑回归模型存在7个自变量X1-X7,其中每个自变量的权重值(重要程度百分比)可通过沃尔德统计量(Wald ChiSq)间接估算得出。可选地,信任指数可以是未缺失的各个自变量的权重值之和,统计过程和统计结果如表2所示: For example, a logistic regression model has seven independent variables X1-X7, and the weight value (% of importance) of each independent variable can be estimated indirectly by Wald statistic (Wald ChiSq). Optionally, the trust index may be the sum of the weight values of the individual independent variables that are not missing, and the statistical process and statistical results are shown in Table 2:
表2Table 2
Figure PCTCN2017106280-appb-000001
Figure PCTCN2017106280-appb-000001
可选地,在将得到的填补后的数据送入机器学习之前,可根据信任指数的高低确定是否抛弃该数据。Optionally, before the obtained padded data is sent to the machine learning, whether the data is discarded may be determined according to the level of the trust index.
可选地,将信任指数大于60%的填补后的数据进行机器学习以提高学习效率同时得到更佳的学习结果。Optionally, machined learning is performed on the padded data with a trust index greater than 60% to improve learning efficiency while achieving better learning outcomes.
图5是本实施例提供的一种数据缺失值填补装置的结构框图,该装置可执行上述实施例提供的数据填补方法,具备执行方法相应的功能模块和有益效果。如图5所示,该装置具体可以包括:获取模块501、缺失率计算模块502和数据填补模块503。FIG. 5 is a structural block diagram of a data missing value padding apparatus according to the embodiment. The device can perform the data padding method provided by the foregoing embodiment, and has the corresponding functional modules and beneficial effects of the execution method. As shown in FIG. 5, the apparatus may specifically include: an obtaining module 501, a missing rate calculating module 502, and a data filling module 503.
其中,获取模块501,设置为获取样本数据和目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率。 The obtaining module 501 is configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, and the objective function is the at least one parameter As an independent variable, the output target variable of the objective function is the user's overdue probability.
缺失率计算模块502,设置为根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率。The missing rate calculation module 502 is configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculate a data missing rate corresponding to the independent variable according to the traversal result.
数据填补模块503,设置为依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。The data padding module 503 is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The method of data padding includes at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
在本实施例中,通过获取存在数据缺失的原始数据以及目标函数,确定所述原始数据的数据缺失率,依据所述数据缺失率的大小采取相应的数据填补方式进行数据缺失值的填补,所述数据填补方式包括标签分组填补、BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少一种,提高了数据缺失值的填补效率,并能够保证数据填补的有效性,使得填补完毕的数据在进行建模或机器学习等计算,例如通过机器学习模型计算用户的信用逾期概率时,能够提高逾期概率计算结果的准确性,进而为用户提供匹配度较高的服务。In this embodiment, the data missing rate of the original data is determined by acquiring the original data with the missing data and the objective function, and the data missing method is used to fill the data missing value according to the size of the data missing rate. The data filling method includes at least one of label group filling, BETA distribution filling, random extraction filling, logistic regression filling, and mean filling, which improves the filling efficiency of data missing values, and can ensure the validity of data filling, so that the filling is completed. When the data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
可选地,所述数据填补模块503是设置为:如果所述数据缺失率大于70%且小于99%,则采用标签分组填补的方式进行数据缺失值填补。Optionally, the data padding module 503 is configured to: if the data missing rate is greater than 70% and less than 99%, perform data missing value padding by using a label grouping padding manner.
可选地,所述数据填补模块503是设置为:如果所述数据缺失率大于5%且小于等于70%,判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异;当所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量无显著差异时,则在所述非缺失值中随机抽取数据对所述自变量对应的样本数据进行缺失值的填补。Optionally, the data padding module 503 is configured to: if the data deletion rate is greater than 5% and less than or equal to 70%, determine whether the target variable corresponding to the missing value corresponding to the missing value in the sample data is There is a significant difference; when there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to perform sample data corresponding to the independent variable. Filling in missing values.
可选地,所述数据填补模块503还设置为:当所述样本数据中缺失值对应 的目标变量与非缺失值对应的目标变量有显著差异时,判断所述非缺失值与目标变量是否显著相关;当所述非缺失值是否与目标变量显著相关时,根据相关方向和和差异程度构建左偏或右偏的BETA分布,利用所述BETA分布对所述自变量对应的样本数据进行缺失值的填补。,如果所述非缺失值与目标变量非显著相关,则采用标签分组填补方对所述自变量对应的样本数据进行缺失值填补。Optionally, the data padding module 503 is further configured to: when the missing value in the sample data corresponds to When there is a significant difference between the target variable and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable; when the non-missing value is significantly correlated with the target variable, according to the relevant direction and degree of difference A left-right or right-biased BETA distribution is constructed, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. If the non-missing value is not significantly correlated with the target variable, the tag grouping filler is used to perform missing value padding on the sample data corresponding to the independent variable.
可选地,所述数据填补模块503是设置为:如果所述数据缺失率小于5%,则判断样本数据中的非缺失值与目标变量是否显著相关;如果所述样本数据中的非缺失值与目标变量非显著相关,则采用均值填补的方式对所述自变量对应的样本数据进行缺失值填补,如果所述样本数据中非缺失值与目标变量显著相关,则采用逻辑回归填补的方式对所述自变量对应的样本数据进行缺失值填补。Optionally, the data padding module 503 is configured to: if the data missing rate is less than 5%, determine whether the non-missing value in the sample data is significantly correlated with the target variable; if the non-missing value in the sample data If there is no significant correlation with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding. If the non-missing value in the sample data is significantly correlated with the target variable, the logistic regression method is used. The sample data corresponding to the independent variable is filled with missing values.
可选地,所述装置还可以包括填补结果评价模块,设置为在依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补之后,计算所述目标函数中的自变量的权重值,依据所述权重值以及填补的数据,确定根据数据缺失值填补后的数据进行后续计算的结果的信任指数。也即采用填补后的数据对用户进行信用预期概率计算时,对计算结果的准确性进行评估。Optionally, the device may further include a padding result evaluation module, configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable. After the filling, the weight value of the independent variable in the objective function is calculated, and the trust index of the result of the subsequent calculation based on the data after the data missing value is determined according to the weight value and the padded data. That is, when the filled data is used to calculate the credit expectation probability of the user, the accuracy of the calculation result is evaluated.
本实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种数据填补方法,该方法包括:The embodiment further provides a storage medium comprising computer executable instructions for performing a data padding method when executed by a computer processor, the method comprising:
获取样本数据与目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率;Obtaining sample data and an objective function, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function The output target variable is the user's overdue probability;
根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;Traversing the sample data according to the independent variable included in the objective function to obtain a traversal result;
根据所述遍历结果,计算所述自变量对应的数据缺失率; Calculating a data deletion rate corresponding to the independent variable according to the traversal result;
依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
该计算机可执行指令在由计算机处理器执行时还可以执行上述实施例提供的任意一种数据填补方法,可以参考上述实施例所提供的方法的流程。The computer-executable instructions can also perform any of the data padding methods provided by the foregoing embodiments when executed by the computer processor. For reference, the flow of the method provided by the foregoing embodiments may be referred to.
本实施例还提供一种数据处理设备,该数据处理设备可以为填补器,如图6所示,是本实施例提供的一种数据处理设备的硬件结构示意图,该数据处理设备可以包括:处理器(processor)610和存储器(memory)620;还可以包括通信接口(Communications Interface)630和总线640。The present embodiment further provides a data processing device, which may be a filler, as shown in FIG. 6 , which is a hardware structure diagram of a data processing device provided by this embodiment, where the data processing device may include: A processor 610 and a memory 620; may further include a communication interface 630 and a bus 640.
其中,处理器610、存储器620和通信接口630可以通过总线640完成相互间的通信。通信接口630可以用于信息传输。处理器610可以调用存储器620中的逻辑指令,以执行上述实施例的任意一种方法。The processor 610, the memory 620, and the communication interface 630 can complete communication with each other through the bus 640. Communication interface 630 can be used for information transmission. Processor 610 can invoke logic instructions in memory 620 to perform any of the methods of the above-described embodiments.
存储器620可以包括存储程序区和存储数据区,存储程序区可以存储操作系统和至少一个功能所需的应用程序。存储数据区可以存储根据数据处理设备的使用所创建的数据等。此外,存储器可以包括,例如,随机存取存储器的易失性存储器,还可以包括非易失性存储器。例如至少一个磁盘存储器件、闪存器件或者其他非暂态固态存储器件。The memory 620 can include a storage program area and a storage data area, and the storage program area can store an operating system and an application required for at least one function. The storage data area can store data and the like created according to the use of the data processing device. Further, the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.
此外,在上述存储器620中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,该逻辑指令可以存储在一个计算机可读取存储介质中。本公开的技术方案可以以计算机软件产品的形式体现出来,该计算机软件产品可以存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本实施例所述方法的 全部或部分步骤。Moreover, when the logic instructions in memory 620 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium. The technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) Performing the method of the embodiment All or part of the steps.
上述实施例方法中的全部或部分流程,是可以通过计算机程序来指示相关的硬件完成的,该程序可存储于一个非暂态计算机可读存储介质中,该程序被执行时,可包括如上述方法的实施例的流程。All or part of the processes in the foregoing embodiment may be completed by a computer program indicating related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, may include the above The flow of an embodiment of the method.
上述存储介质可以是多种类型的存储器设备或存储设备,可以包括:安装介质,例如CD-ROM、软盘或磁带装置;计算机系统存储器或随机存取存储器,诸如DRAM、DDR RAM、SRAM、EDO RAM,兰巴斯(Rambus)RAM等;非易失性存储器,诸如闪存、磁介质(例如硬盘或光存储);寄存器或相似类型的存储器元件等。存储介质可以还包括多种类型的存储器或存储器的组合。另外,存储介质可以位于程序在其中被执行的第一计算机系统中,或者可以位于不同的第二计算机系统中,第二计算机系统通过网络(诸如因特网)连接到第一计算机系统。第二计算机系统可以提供程序指令给第一计算机用于执行。存储介质还可以包括可以驻留在不同位置中(例如在通过网络连接的不同计算机系统中)的两个或更多存储介质。存储介质可以存储可由一个或多个处理器执行的程序指令如计算机程序。The above storage medium may be a plurality of types of memory devices or storage devices, and may include: a mounting medium such as a CD-ROM, a floppy disk or a tape device; a computer system memory or a random access memory such as DRAM, DDR RAM, SRAM, EDO RAM , Rambus RAM, etc.; non-volatile memory, such as flash memory, magnetic media (such as hard disk or optical storage); registers or similar types of memory components, and the like. The storage medium may also include multiple types of memory or a combination of memories. Additionally, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being coupled to the first computer system via a network, such as the Internet. The second computer system can provide program instructions to the first computer for execution. The storage medium may also include two or more storage media that can reside in different locations (e.g., in different computer systems connected through a network). A storage medium may store program instructions, such as a computer program, executable by one or more processors.
工业实用性Industrial applicability
本公开提供一种数据填补方法和装置,可以提高数据缺失值的填补效率,并能够保证数据填补的有效性,使得通过填补后的数据进行建模或机器学习等计算,例如通过机器学习模型计算用户的信用逾期概率时,能够提高逾期概率计算结果的准确性,进而为用户提供匹配度较高的服务。 The present disclosure provides a data padding method and apparatus, which can improve the filling efficiency of data missing values, and can ensure the validity of data filling, so that modeling or machine learning calculation is performed through the filled data, for example, by machine learning model calculation. When the user's credit overdue probability, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

Claims (16)

  1. 一种数据填补方法,包括:A method of data filling, including:
    获取样本数据与目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率;Obtaining sample data and an objective function, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function The output target variable is the user's overdue probability;
    根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;Traversing the sample data according to the independent variable included in the objective function to obtain a traversal result;
    根据所述遍历结果,计算所述自变量对应的数据缺失率;Calculating a data deletion rate corresponding to the independent variable according to the traversal result;
    依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。According to the missing rate interval to which the data deletion rate belongs, a corresponding data padding method is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data filling modes, Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  2. 根据权利要求1所述的方法,其中,依据所述数据缺失率所属的缺失率区间采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,包括:The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:
    当所述数据缺失率大于70%且小于99%时,则采用标签分组填补的方式对所述自变量对应的样本数据进行缺失值填补。When the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values by means of label grouping padding.
  3. 根据权利要求1所述的方法,其中,依据所述数据缺失率所属的缺失率区间采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,包括:The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:
    当所述数据缺失率大于5%且小于等于70%时,判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异;When the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value;
    当所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量无显著差异时,则在所述非缺失值中随机抽取数据对所述自变量对应的样本数据进行缺失值的填补。 When there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to fill the missing value of the sample data corresponding to the independent variable. .
  4. 根据权利要求3所述的方法,其中,判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异之后,还包括:The method according to claim 3, wherein after determining whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, the method further comprises:
    当所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量有显著差异时,判断所述非缺失值与目标变量是否显著相关;When the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, determining whether the non-missing value is significantly correlated with the target variable;
    当所述非缺失值是否与目标变量显著相关时,根据相关方向和和差异程度构建左偏或右偏的BETA分布,利用所述BETA分布对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is significantly correlated with the target variable, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. .
  5. 根据权利要求4所述的方法,其中,判断所述非缺失值与目标变量是否显著相关之后,还包括:The method according to claim 4, wherein after determining whether the non-missing value is significantly correlated with the target variable, the method further comprises:
    当所述非缺失值是否与目标变量非显著相关时,则采用标签分组填补方式对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is not significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by using a label grouping padding method.
  6. 根据权利要求1所述的方法,其中,依据所述数据缺失率所属的缺失率区间采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,包括:The method according to claim 1, wherein the data padding manner is adopted according to the missing rate interval to which the data missing rate belongs, and the missing value is filled in the sample data corresponding to the independent variable, including:
    当所述数据缺失率小于等于5%时,则判断所述样本数据中的非缺失值与目标变量是否显著相关;When the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable;
    当所述非缺失值与目标变量显著相关时,则采用逻辑回归填补的方式对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of logistic regression.
  7. 根据权利要求6所述的方法,其中,判断样本数据中的非缺失值与目标变量是否显著相关之后,还包括:The method according to claim 6, wherein after determining whether the non-missing value in the sample data is significantly correlated with the target variable, the method further comprises:
    当所述非缺失值与目标变量非显著相关时,则采用均值填补的方式对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is not significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding.
  8. 根据权利要求1-7中任一项所述的方法,其中,依据所述数据缺失率所 属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补之后,还包括:The method according to any one of claims 1 to 7, wherein the data deletion rate is The missing rate interval of the genus adopts the corresponding data filling method, and after the missing value is filled in the sample data corresponding to the independent variable, the method further includes:
    计算所述目标函数中的自变量的权重值,依据所述权重值以及填补的数据确定后续计算结果的信任指数。Calculating a weight value of the independent variable in the objective function, and determining a trust index of the subsequent calculation result according to the weight value and the padded data.
  9. 一种数据填补装置,包括:A data filling device comprising:
    获取模块,设置为获取样本数据和目标函数,其中,所述样本数据包括工资收入、工作时间和还款记录中的至少一个参数对应的数据,所述目标函数以所述至少一个参数为自变量,所述目标函数的输出目标变量为用户的逾期概率;An acquisition module, configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable The output target variable of the objective function is a user's overdue probability;
    缺失率计算模块,设置为根据所述目标函数中包含的所述自变量遍历所述样本数据,得到遍历结果;根据所述遍历结果,计算所述自变量对应的数据缺失率;a missing rate calculation module, configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;
    数据填补模块,设置为依据所述数据缺失率所属的缺失率区间,采取相应的数据填补方式,对所述自变量对应的样本数据进行缺失值的填补,其中,不同的缺失率区间对应不同的数据填补方式,所述数据填补方式包括标签分组填补、贝塔BETA分布填补、随机抽取填补、逻辑回归填补以及均值填补中至少两种。The data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different The data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  10. 根据权利要求9所述的装置,其中,所述数据填补模块是设置为:The apparatus of claim 9 wherein said data padding module is configured to:
    当所述数据缺失率大于70%且小于99%时,则采用标签分组填补的方式对所述自变量对应的样本数据进行缺失值填补。When the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values by means of label grouping padding.
  11. 根据权利要求9所述的装置,其中,所述数据填补模块是设置为:The apparatus of claim 9 wherein said data padding module is configured to:
    当所述数据缺失率大于5%且小于等于70%时,判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异;When the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value;
    当所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量无显 著差异时,则在所述非缺失值中随机抽取数据对所述自变量对应的样本数据进行缺失值的填补。When the target variable corresponding to the missing value in the sample data has no significant target variable corresponding to the non-missing value When the difference is found, the data is randomly extracted from the non-missing values, and the missing sample values are filled in the sample data corresponding to the independent variable.
  12. 根据权利要求11所述的装置,其中,所述数据填补模块还设置为:在判断所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量是否有显著差异之后,当所述样本数据中缺失值对应的目标变量与非缺失值对应的目标变量有显著差异时,判断所述非缺失值与目标变量是否显著相关;The apparatus according to claim 11, wherein the data padding module is further configured to: after determining whether the target variable corresponding to the missing value in the sample data has a significant difference from the target variable corresponding to the non-missing value, when When there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable;
    当所述非缺失值是否与目标变量显著相关时,根据相关方向和和差异程度构建左偏或右偏的BETA分布,利用所述BETA分布对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is significantly correlated with the target variable, a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. .
  13. 根据权利要求12所述的装置,其中,所述数据填补模块还设置为:在判断所述非缺失值与目标变量是否显著相关之后,当所述非缺失值是否与目标变量非显著相关时,则采用标签分组填补方式对所述自变量对应的样本数据进行缺失值的填补。The apparatus according to claim 12, wherein said data padding module is further configured to: when it is determined whether said non-missing value is significantly correlated with a target variable, and when said non-missing value is not significantly correlated with the target variable, Then, the sample data corresponding to the independent variable is padded with missing values by using a label grouping filling method.
  14. 根据权利要求9所述的装置,其中,所述数据填补模块还设置为:The apparatus of claim 9, wherein the data padding module is further configured to:
    当所述数据缺失率小于等于5%时,则判断样本数据中的非缺失值与目标变量是否显著相关;When the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable;
    当所述非缺失值与目标变量显著相关时,则采用逻辑回归填补的方式对所述自变量对应的样本数据进行缺失值的填补。When the non-missing value is significantly correlated with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of logistic regression.
  15. 根据权利要求14所述的装置,其中,所述数据填补模块还设置为:在判断样本数据中的非缺失值与目标变量是否显著相关之后,当所述非缺失值与目标变量非显著相关时,则采用均值填补的方式对所述自变量对应的样本数据进行缺失值的填补。The apparatus of claim 14, wherein the data padding module is further configured to: after determining whether the non-missing value in the sample data is significantly correlated with the target variable, when the non-missing value is not significantly correlated with the target variable Then, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding.
  16. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可 执行指令用于执行权利要求1-8中任一项所述的方法。 A computer readable storage medium storing computer executable instructions, the computer Execution of instructions for performing the method of any of claims 1-8.
PCT/CN2017/106280 2017-10-16 2017-10-16 Data filling method and device WO2019075599A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780039488.0A CN109564641B (en) 2017-10-16 2017-10-16 Data filling method and device
PCT/CN2017/106280 WO2019075599A1 (en) 2017-10-16 2017-10-16 Data filling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/106280 WO2019075599A1 (en) 2017-10-16 2017-10-16 Data filling method and device

Publications (1)

Publication Number Publication Date
WO2019075599A1 true WO2019075599A1 (en) 2019-04-25

Family

ID=65863683

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106280 WO2019075599A1 (en) 2017-10-16 2017-10-16 Data filling method and device

Country Status (2)

Country Link
CN (1) CN109564641B (en)
WO (1) WO2019075599A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704697A (en) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 Medical data missing processing method, device and equipment based on multiple regression model
CN117453696A (en) * 2023-12-07 2024-01-26 深圳拓安信物联股份有限公司 Method and device for supplementing missing data of water meter

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276412A (en) * 2019-06-28 2019-09-24 中煤科工集团重庆研究院有限公司 A kind of unordered complementing method of gas-monitoring data
CN111061999B (en) * 2019-11-19 2023-08-22 平安科技(深圳)有限公司 Data sample acquisition method, device and storage medium
CN112365070A (en) * 2020-11-18 2021-02-12 深圳供电局有限公司 Power load prediction method, device, equipment and readable storage medium
CN113468152A (en) * 2021-06-04 2021-10-01 国网上海市电力公司 High-frequency user electricity consumption data cleaning method, system, equipment and storage medium
CN113672871A (en) * 2021-08-23 2021-11-19 广东电网有限责任公司 High-proportion missing data filling method and related device
CN113742326B (en) * 2021-09-01 2024-04-12 阳光电源股份有限公司 Power optimizer and power missing value filling method and device thereof
CN113851191A (en) * 2021-09-06 2021-12-28 中科曙光国际信息产业有限公司 Gene filling method, apparatus, computer device and storage medium
CN113850523A (en) * 2021-09-29 2021-12-28 平安科技(深圳)有限公司 ESG index determining method based on data completion and related product

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218468A1 (en) * 2005-03-09 2006-09-28 Matsushita Electric Industrial Co., Ltd. Memory initialization device, memory initialization method, and error correction device
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models
CN103440283A (en) * 2013-08-13 2013-12-11 江苏华大天益电力科技有限公司 Vacancy filling system for measured point data and vacancy filling method
CN104392400A (en) * 2014-12-10 2015-03-04 国家电网公司 Electric power marketing missing data completion method
CN105468594A (en) * 2014-08-11 2016-04-06 中兴通讯股份有限公司 Method and system for optimizing data collection and server

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786860B (en) * 2014-12-23 2020-07-07 华为技术有限公司 Data processing method and device in data modeling
CN105488736A (en) * 2015-12-02 2016-04-13 国家电网公司 Data processing method for photovoltaic power station data acquisition system
CN106919957B (en) * 2017-03-10 2020-03-10 广州视源电子科技股份有限公司 Method and device for processing data
CN107193876B (en) * 2017-04-21 2020-10-09 美林数据技术股份有限公司 Missing data filling method based on nearest neighbor KNN algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218468A1 (en) * 2005-03-09 2006-09-28 Matsushita Electric Industrial Co., Ltd. Memory initialization device, memory initialization method, and error correction device
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models
CN103440283A (en) * 2013-08-13 2013-12-11 江苏华大天益电力科技有限公司 Vacancy filling system for measured point data and vacancy filling method
CN105468594A (en) * 2014-08-11 2016-04-06 中兴通讯股份有限公司 Method and system for optimizing data collection and server
CN104392400A (en) * 2014-12-10 2015-03-04 国家电网公司 Electric power marketing missing data completion method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704697A (en) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 Medical data missing processing method, device and equipment based on multiple regression model
CN113704697B (en) * 2021-08-31 2023-12-26 平安科技(深圳)有限公司 Medical data missing processing method, device and equipment based on multiple regression model
CN117453696A (en) * 2023-12-07 2024-01-26 深圳拓安信物联股份有限公司 Method and device for supplementing missing data of water meter
CN117453696B (en) * 2023-12-07 2024-04-12 深圳拓安信物联股份有限公司 Method and device for supplementing missing data of water meter

Also Published As

Publication number Publication date
CN109564641B (en) 2023-08-25
CN109564641A (en) 2019-04-02

Similar Documents

Publication Publication Date Title
CN109564641B (en) Data filling method and device
US20220075670A1 (en) Systems and methods for replacing sensitive data
WO2017133492A1 (en) Risk assessment method and system
EP3591586A1 (en) Data model generation using generative adversarial networks and fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome
WO2019136990A1 (en) Network data detection method, apparatus, computer device and storage medium
WO2015135321A1 (en) Method and device for mining social relationship based on financial data
US20140314311A1 (en) System and method for classification with effective use of manual data input
CN110175697B (en) Adverse event risk prediction system and method
US20230091402A1 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
CN112420187B (en) Medical disease analysis method based on migratory federal learning
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
WO2022247955A1 (en) Abnormal account identification method, apparatus and device, and storage medium
WO2017071474A1 (en) Method and device for processing language data items and method and device for analyzing language data items
Satyanarayana Intelligent sampling for big data using bootstrap sampling and chebyshev inequality
US20190220924A1 (en) Method and device for determining key variable in model
WO2021174699A1 (en) User screening method, apparatus and device, and storage medium
Wang et al. Discovering truths from distributed data
JP2021068448A5 (en)
WO2020233067A1 (en) User behavior-based data sharing method and apparatus, and computer device
WO2022022042A1 (en) Monitoring data reporting method and apparatus, computer device, and storage medium
JPWO2019168599A5 (en)
CN110083517B (en) User image confidence optimization method and device
CN112784168A (en) Information push model training method and device, and information push method and device
CN110837847A (en) User classification method and device, storage medium and server
CN110837459A (en) Big data-based operation performance analysis method and system

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17928937

Country of ref document: EP

Kind code of ref document: A1