WO2019075599A1 - Procédé et dispositif de remplissage de données - Google Patents

Procédé et dispositif de remplissage de données Download PDF

Info

Publication number
WO2019075599A1
WO2019075599A1 PCT/CN2017/106280 CN2017106280W WO2019075599A1 WO 2019075599 A1 WO2019075599 A1 WO 2019075599A1 CN 2017106280 W CN2017106280 W CN 2017106280W WO 2019075599 A1 WO2019075599 A1 WO 2019075599A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
missing
padding
missing value
sample data
Prior art date
Application number
PCT/CN2017/106280
Other languages
English (en)
Chinese (zh)
Inventor
赵敏
林磊
Original Assignee
深圳乐信软件技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳乐信软件技术有限公司 filed Critical 深圳乐信软件技术有限公司
Priority to CN201780039488.0A priority Critical patent/CN109564641B/zh
Priority to PCT/CN2017/106280 priority patent/WO2019075599A1/fr
Publication of WO2019075599A1 publication Critical patent/WO2019075599A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present disclosure relates to the field of data processing technologies, for example, to a data padding method and apparatus.
  • missing data may carry useful or critical information, if not missing If the values are properly processed, the missing data may affect the construction of subsequent models, such as the construction of models such as logistic regression and neural networks, and reduce the training effect of the machine learning model.
  • the corresponding machine learning model In the field of e-commerce, when evaluating the credit of users, the corresponding machine learning model is usually used to calculate the overdue probability of the user, and then the credit of the user is evaluated. If there is data missing in the user sample data during machine training, the training may be made. Obtaining the machine learning model cannot accurately calculate the user's overdue probability, which makes it impossible to provide the user with a highly matched service, such as adjusting the user's credit limit. In the related art, the missing value is usually filled by manual filling. Large amounts, low efficiency, and relying on human experience cannot guarantee the validity of the data being filled.
  • the present disclosure provides a data padding method and apparatus, which can improve the efficiency of data padding.
  • This embodiment provides a data padding method, which may include:
  • sample data includes wage income, working time Data corresponding to at least one parameter in the repayment record, the objective function having the at least one parameter as an independent variable, and an output target variable of the objective function being a user's overdue probability;
  • Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the embodiment further provides a data filling device, which may include:
  • An acquisition module configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, wherein the objective function takes the at least one parameter as an independent variable
  • the output target variable of the objective function is a user's overdue probability
  • a missing rate calculation module configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculating, according to the traversal result, a data missing rate corresponding to the independent variable;
  • the data padding module is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different
  • the data padding method includes at least two types of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the above methods.
  • This embodiment also provides a data processing device, the data processing device including one or more processes And a memory and one or more programs, the one or more programs being stored in a memory, and when executed by one or more processors, performing any of the methods described above.
  • the embodiment further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.
  • This embodiment can improve the filling efficiency of the data missing value, and can ensure the validity of the data filling, so that the filled data can be calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model. It can improve the accuracy of the calculation results of overdue probability, and thus provide users with higher matching services.
  • FIG. 1 is a flowchart of a data padding method according to an embodiment
  • FIG. 3A is a flowchart of another data padding method according to an embodiment
  • FIG. 3B is a graph showing a BETA distribution corresponding to different parameter values ⁇ and ⁇ according to an embodiment
  • FIG. 5 is a structural block diagram of a data missing value filling apparatus according to an embodiment
  • FIG. 6 is a schematic structural diagram of hardware of a data processing device according to an embodiment.
  • FIG. 1 is a flowchart of a data padding method provided by this embodiment. This embodiment is applicable to a case where padding data is padded.
  • the method may be performed by a computing device such as a computer, and the method may be performed by a data padding device.
  • the data filling device can be implemented in at least one of software and hardware. As shown in FIG. 1, the method provided in this embodiment may include the following steps:
  • step 110 the sample data and the objective function are acquired, wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, and the objective function takes the at least one parameter as an independent variable.
  • the output target variable of the objective function is the user's overdue probability.
  • the sample data may also be called original data, and the objective function may include a logistic regression model function and a neural network model function.
  • the target variable output by the logic function may be a user's repayment overdue probability, referred to as an overdue probability, and the original data may be a predicted user.
  • the sample data of the overdue probability for example, the sample data may include information such as the user's salary income, working years, and the user's repayment record, and the sample data may be referred to as an independent variable.
  • Missing data can be referred to as missing values, and missing values represent the data content of some of the missing data in the acquired raw data (such as big data). The existence of missing values in the original data may lead to the use of the corresponding objective function for modeling or learning training, which makes the establishment of the model biased, and the learning training effect is not satisfactory.
  • the missing value may be caused by mechanical reasons (such as data loss caused by data collection or preservation) or human reasons (such as staff's subjective errors or historical limitations).
  • the missing values can be divided into completely random missing (the missing data is random, the missing data does not depend on any incomplete or complete variables), random missing (the missing data is not completely random) , that is, the lack of such data depends on other complete variables) and the complete non-random deletion (meaning that the lack of data depends on the incomplete variable itself).
  • the missing values can be classified as single-valued deletions (the same attribute of the missing value) and any missing (the attribute of the missing value is different).
  • step 120 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • the data missing rate in the raw data can be determined by a code program.
  • a logistic regression model function consists of seven independent variables, each of which contains multiple data, which are read sequentially by the program. When the return value is null, the data is missing, the number of missing data is increased by 1, and after all the data is traversed in turn, the data missing rate of the original data can be counted.
  • the sample data includes 100 user information, 70 people's salary information, and the remaining 30 people's salary information is missing.
  • the data loss rate corresponding to the independent variable of salary information is 30%, and the salary information of these 30 people needs to be filled. .
  • step 130 according to the missing rate interval to which the data missing rate belongs, a corresponding data filling manner is adopted, and the missing value is filled in the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different data.
  • the method of filling includes at least two types of label group filling, beta BETA distribution filling, random extraction filling, logistic regression filling, and mean filling.
  • the data missing information can be automatically filled according to the data deletion rate determined in step 120 to complete the filling of the missing data.
  • the objective function may be a function for calculating the expected probability of the user, traversing the original data containing the user information according to variables involved in the objective function, such as the user's salary income, working years, and the user's repayment record, according to each variable Traversing the result, calculating the data missing rate of the variable, taking the data filling method according to the data missing rate, and filling the missing sample data in the variable to ensure the integrity of the sample data.
  • a data abnormality alarm may be issued, and the alarm content may be “recommended manual detection”, or the original data may be directly discarded; when the data missing rate is compared Low, that is, most of the data is complete and only a small part of the data is missing. If the data missing rate is less than 5%, the data can be filled by logistic regression; when the data missing rate is in the (70%, 99%) interval When the missing rate is in the (5%, 70%) interval, the missing value can be filled by the method of filling the BETA distribution.
  • the original data is reasonably reserved, and the problem of the amount of data being dropped due to the complete deletion of the data content due to the absence of one or a part of the variables is avoided, according to different data.
  • the missing rate adopts different data filling methods. When the original information and attributes of the missing value part are retained, the distribution of the data and the attribute of the missing data are reduced, and the data filling can be automatically performed, and the data filling efficiency is improved. Reduce the labor burden.
  • data missing values can be filled by deleting data records, mean padding or manual padding.
  • the method of deleting data records when the sample size is small and the training model data is insufficient, the overall training effect of the model will be seriously affected; if the mean value filling method is adopted, the data loss rate will be serious if the data missing rate is high. Affecting the distribution state of the original non-missing value, the original non-missing value distribution is gathered at a certain point. For the non-randomness missing, after filling, the information covered by the missing value will be hidden; the defect of the manual filling method is that In a large data environment with large data volume, manual filling is heavy and inefficient, and it relies heavily on human experience and is not suitable for machine learning environments.
  • the embodiment provides a data padding method, which determines the data missing rate of the sample data by acquiring the original data with the missing data and the objective function, and adopts a corresponding data padding method to perform data deletion according to the size of the data missing rate.
  • the padding of the data includes at least one of tag group padding, BETA distribution padding, random padding padding, logistic regression padding, and mean padding, which improves the filling efficiency of data missing values and ensures the validity of data padding. Therefore, when the filled data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service. .
  • FIG. 2 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 2, the method may include the following steps:
  • step 210 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of wage income, working time, and repayment history, and the objective function takes the at least one parameter as an independent variable, and the objective function
  • the output target variable is the user's overdue probability.
  • step 220 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 230 when the data deletion rate is greater than 70% and less than 99%, the sample data corresponding to the independent variable is padded with missing values in a manner of label group padding.
  • the data loss rate is greater than 70% and less than 99% is a serious data loss.
  • the label grouping method can improve the data filling efficiency.
  • the data can be padded by two sets of markings (1/0), as shown in Table 1:
  • variable X1 if the user's data of the user numbers 001 and 004 is missing, a corresponding dummy variable (X11) can be added accordingly, and the 001 user and the 004 user are assigned the value 1 in the X11, the user 002 and If the value of the X1 variable of the user 003 is not missing, the user 002 and the user 003 are both assigned 0 in X11, and the padding of the missing value is completed.
  • variables with a high deletion rate eg, a deletion rate greater than 99%
  • the data missing value padding is performed by means of label group padding, that is, the label grouping is used in the case where the data missing rate is high. Ways to improve the efficiency of data filling.
  • FIG. 3A is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 3A, the method provided in this embodiment may include the following steps:
  • step 310 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
  • step 320 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 330 when the data deletion rate is greater than 5% and less than or equal to 70%, it is determined whether there is a significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, if otherwise, the step is performed. 340, if yes, perform step 350.
  • Correlation refers to the monotonic relationship between the variable and the target variable.
  • the spearman correlation function can be used for correlation judgment.
  • the spearman is a non-parametric statistical method, which does not depend on the distribution of variables, that is, whether the non-missing value is a normal distribution or Non-normal distribution can find the degree and direction of the relationship between the non-missing value and the objective function.
  • the Spearman's rank correlation coefficient (Spearman's rank correlation coefficient) is calculated.
  • the Spearman coefficient can reflect the degree of correlation between the non-missing value (that is, the above variable) and the target variable. Close to 1 or -1, the greater the degree of correlation, where the Spearman coefficient is positive for positive correlation and negative for negative correlation.
  • the threshold range of the Spearman coefficient can be set. If the variable and the target variable Spearman coefficient satisfy the set threshold range, it is significantly correlated. When the variable and the target variable Spearman coefficient do not satisfy the set threshold range, it is non-significant correlation.
  • step 340 randomly extracting data from the non-missing values performs padding of missing values on sample data corresponding to the independent variable.
  • the data is randomly extracted in the non-missing value for padding.
  • step 350 it is determined whether the non-missing value is significantly related to the target variable, and if so, step 360 is performed, if otherwise, step 370 is performed.
  • step 360 a left-right or right-biased BETA distribution is constructed according to the relevant direction and the degree of difference, and the sample data corresponding to the independent variable is padded with missing values by using the BETA distribution.
  • the degree of difference refers to the degree of difference between the target variable corresponding to the missing value and the target variable corresponding to the non-missing value, and the degree of the difference may be determined according to the analysis of variance, for example, the expected probability of multiple users who have salary information and no wages.
  • the overdue probability of the plurality of users of the information is separately calculated by the variance, and the degree of the difference is determined based on the result of the variance calculation.
  • a left-right or right-biased distribution can be formed by adjusting the parameters ⁇ and ⁇ in the BETA distribution, that is, constructing a left-right or right-biased BETA distribution within a range of values of the non-missing value.
  • the bias of the BETA distribution is related to the missing value part, the non-missing part and the target variable of the variable. turn off.
  • the skewness of the BETA distribution is determined by the correlation between the missing value part and the target variable. For example, the greater the degree of correlation, the greater the skewness of the left or right deviation of the BEAT distribution, and the randomly generated value used to fill the missing value is extreme. The higher the probability of a value, the extreme value can be understood as the maximum or minimum value, or the value within the data range containing the maximum or minimum value.
  • the average value of the BETA distribution AVG ⁇ / ( ⁇ + ⁇ )
  • the variance of the BETA distribution VAR ⁇ * ⁇ / (( ⁇ + ⁇ ) ⁇ 2 * ( ⁇ + ⁇ + 1)), derived therefrom ( Where r is the intermediate variable):
  • ⁇ and ⁇ are parameters that jointly determine the morphology of the BETA distribution.
  • ⁇ > ⁇ the probability that the missing value takes a small value is large, that is, the distribution pattern is right-biased, and when ⁇ , the value of the missing value is large.
  • FIG. 3B is a graph of BETA distribution corresponding to different parameter values ⁇ and ⁇ provided by the present embodiment.
  • the non-missing value and the correlation ⁇ of the target variable, P50, MAX and MIN in the non-missing value jointly determine ⁇ and ⁇ in the estimated value distribution corresponding to the missing value, and further Determine the shape of the BETA distribution.
  • a new average value New_AVG is constructed by P50, MAX and MIN in the non-missing value, and ⁇ and ⁇ are calculated by New_AVG and VAR of the non-missing value part, wherein New_AVG is calculated as follows:
  • New_AVG (MAX-P50)*
  • New_AVG P50-(P50-MIN)*
  • step 370 the sample data corresponding to the argument is padded with missing values in a label grouping manner.
  • the embodiment provides a data filling method, improves the filling efficiency of the missing data value, and ensures the validity of the data filling, so that the filled data is calculated by modeling or machine learning, for example, by a machine learning model.
  • the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
  • FIG. 4 is a flowchart of another data padding method provided by this embodiment. As shown in FIG. 4, the method provided in this embodiment may include the following steps:
  • step 410 sample data and an objective function are obtained.
  • the sample data includes data corresponding to at least one parameter of salary income, working time, and repayment history, the objective function takes the at least one parameter as an independent variable, and an output target variable of the objective function is a user's Overdue probability.
  • step 420 the sample data is traversed according to the independent variable included in the objective function to obtain a traversal result; and according to the traversal result, a data missing rate corresponding to the independent variable is calculated.
  • step 430 when the data deletion rate is less than or equal to 5%, it is determined whether the non-missing value in the sample data is significantly correlated with the target variable, if otherwise, step 440 is performed, and if yes, step 450 is performed.
  • step 440 the sample data corresponding to the argument is padded with missing values in a mean padding manner.
  • the mean padding refers to calculating the mean value of the non-missing part of the variable, and filling the mean value into the missing value part.
  • the mean can also be replaced by a median or a mode.
  • step 450 the missing sample values are filled in the sample data corresponding to the independent variable by means of logistic regression.
  • This embodiment provides a data padding method, which improves the filling efficiency of data missing values, so that the filled data is more accurate when performing modeling or machine learning calculations.
  • the method further includes: calculating a weight value of the variable in the original data, according to the weight value and the filling Data, a trust index that determines the results of subsequent calculations based on data that has been filled with missing data values.
  • the missing missing values are recorded by corresponding data, and the subsequent related computing models are calculated according to the filled data. After the result, the trust index of the result can be given.
  • a logistic regression model has seven independent variables X1-X7, and the weight value (% of importance) of each independent variable can be estimated indirectly by Wald statistic (Wald ChiSq).
  • the trust index may be the sum of the weight values of the individual independent variables that are not missing, and the statistical process and statistical results are shown in Table 2:
  • whether the data is discarded may be determined according to the level of the trust index.
  • machined learning is performed on the padded data with a trust index greater than 60% to improve learning efficiency while achieving better learning outcomes.
  • FIG. 5 is a structural block diagram of a data missing value padding apparatus according to the embodiment.
  • the device can perform the data padding method provided by the foregoing embodiment, and has the corresponding functional modules and beneficial effects of the execution method.
  • the apparatus may specifically include: an obtaining module 501, a missing rate calculating module 502, and a data filling module 503.
  • the obtaining module 501 is configured to acquire sample data and an objective function, wherein the sample data includes data corresponding to at least one parameter of a salary income, a working time, and a repayment record, and the objective function is the at least one parameter As an independent variable, the output target variable of the objective function is the user's overdue probability.
  • the missing rate calculation module 502 is configured to traverse the sample data according to the independent variable included in the objective function to obtain a traversal result; and calculate a data missing rate corresponding to the independent variable according to the traversal result.
  • the data padding module 503 is configured to perform a data padding manner according to the missing rate interval to which the data missing rate belongs, and to fill the missing value of the sample data corresponding to the independent variable, wherein different missing rate intervals correspond to different
  • the method of data padding includes at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the data missing rate of the original data is determined by acquiring the original data with the missing data and the objective function, and the data missing method is used to fill the data missing value according to the size of the data missing rate.
  • the data filling method includes at least one of label group filling, BETA distribution filling, random extraction filling, logistic regression filling, and mean filling, which improves the filling efficiency of data missing values, and can ensure the validity of data filling, so that the filling is completed.
  • the data is calculated by modeling or machine learning, for example, by calculating the credit overdue probability of the user through the machine learning model, the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.
  • the data padding module 503 is configured to: if the data missing rate is greater than 70% and less than 99%, perform data missing value padding by using a label grouping padding manner.
  • the data padding module 503 is configured to: if the data deletion rate is greater than 5% and less than or equal to 70%, determine whether the target variable corresponding to the missing value corresponding to the missing value in the sample data is There is a significant difference; when there is no significant difference between the target variable corresponding to the missing value in the sample data and the target variable corresponding to the non-missing value, the randomly extracted data in the non-missing value is used to perform sample data corresponding to the independent variable. Filling in missing values.
  • the data padding module 503 is further configured to: when the missing value in the sample data corresponds to When there is a significant difference between the target variable and the target variable corresponding to the non-missing value, it is determined whether the non-missing value is significantly correlated with the target variable; when the non-missing value is significantly correlated with the target variable, according to the relevant direction and degree of difference A left-right or right-biased BETA distribution is constructed, and the sample data corresponding to the independent variable is filled with the missing value by using the BETA distribution. If the non-missing value is not significantly correlated with the target variable, the tag grouping filler is used to perform missing value padding on the sample data corresponding to the independent variable.
  • the data padding module 503 is configured to: if the data missing rate is less than 5%, determine whether the non-missing value in the sample data is significantly correlated with the target variable; if the non-missing value in the sample data If there is no significant correlation with the target variable, the missing value is filled in the sample data corresponding to the independent variable by means of mean padding. If the non-missing value in the sample data is significantly correlated with the target variable, the logistic regression method is used. The sample data corresponding to the independent variable is filled with missing values.
  • the device may further include a padding result evaluation module, configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable.
  • a padding result evaluation module configured to adopt a corresponding data padding manner according to the missing rate interval to which the data missing rate belongs, and perform missing value on the sample data corresponding to the independent variable.
  • the embodiment further provides a storage medium comprising computer executable instructions for performing a data padding method when executed by a computer processor, the method comprising:
  • sample data and an objective function wherein the sample data includes data corresponding to at least one of a salary income, a working time, and a repayment record, the objective function having the at least one parameter as an independent variable, the objective function
  • the output target variable is the user's overdue probability
  • Data padding methods include at least two of tag grouping padding, beta BETA distribution padding, random padding padding, logistic regression padding, and mean padding.
  • the computer-executable instructions can also perform any of the data padding methods provided by the foregoing embodiments when executed by the computer processor.
  • the flow of the method provided by the foregoing embodiments may be referred to.
  • the present embodiment further provides a data processing device, which may be a filler, as shown in FIG. 6 , which is a hardware structure diagram of a data processing device provided by this embodiment, where the data processing device may include: A processor 610 and a memory 620; may further include a communication interface 630 and a bus 640.
  • the processor 610, the memory 620, and the communication interface 630 can complete communication with each other through the bus 640.
  • Communication interface 630 can be used for information transmission.
  • Processor 610 can invoke logic instructions in memory 620 to perform any of the methods of the above-described embodiments.
  • the memory 620 can include a storage program area and a storage data area, and the storage program area can store an operating system and an application required for at least one function.
  • the storage data area can store data and the like created according to the use of the data processing device.
  • the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.
  • the logic instructions in memory 620 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium.
  • the technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) Performing the method of the embodiment All or part of the steps.
  • All or part of the processes in the foregoing embodiment may be completed by a computer program indicating related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, may include the above The flow of an embodiment of the method.
  • the above storage medium may be a plurality of types of memory devices or storage devices, and may include: a mounting medium such as a CD-ROM, a floppy disk or a tape device; a computer system memory or a random access memory such as DRAM, DDR RAM, SRAM, EDO RAM , Rambus RAM, etc.; non-volatile memory, such as flash memory, magnetic media (such as hard disk or optical storage); registers or similar types of memory components, and the like.
  • the storage medium may also include multiple types of memory or a combination of memories. Additionally, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system, the second computer system being coupled to the first computer system via a network, such as the Internet.
  • the second computer system can provide program instructions to the first computer for execution.
  • the storage medium may also include two or more storage media that can reside in different locations (e.g., in different computer systems connected through a network).
  • a storage medium may store program instructions, such as a computer program, executable by one or more processors.
  • the present disclosure provides a data padding method and apparatus, which can improve the filling efficiency of data missing values, and can ensure the validity of data filling, so that modeling or machine learning calculation is performed through the filled data, for example, by machine learning model calculation.
  • the accuracy of the overdue probability calculation result can be improved, thereby providing the user with a highly matched service.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

L'invention concerne un procédé et un dispositif de remplissage de données. Le procédé peut comprendre les étapes consistant : à acquérir des données d'échantillon et une fonction objective, les données d'échantillon incluant des données qui correspondent à au moins un paramètre parmi des enregistrements de salaire, de temps de travail et de remboursements, la fonction objective prenant le ou les paramètres comme variable indépendante, et la variable cible de sortie de la fonction objective étant la probabilité d'un impayé pour l'utilisateur ; à traverser les données d'échantillon selon la variable indépendante incluse dans la fonction objective pour obtenir un résultat de traversée ; à calculer, en fonction du résultat de traversée, un taux de manque de données correspondant à la variable indépendante ; et à remplir, selon un intervalle de taux de manque auquel le taux de manque de données appartient, une valeur manquante dans les données d'échantillon correspondant à la variable indépendante au moyen du procédé de remplissage de données correspondant, différents intervalles de taux de manque correspondant à différents procédés de remplissage de données, et les procédés de remplissage de données étant au moins deux parmi un remplissage par regroupement d'étiquettes, un remplissage par distribution bêta, un remplissage par sélection aléatoire, un remplissage par régression logistique et un remplissage par moyenne.
PCT/CN2017/106280 2017-10-16 2017-10-16 Procédé et dispositif de remplissage de données WO2019075599A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780039488.0A CN109564641B (zh) 2017-10-16 2017-10-16 数据填补方法和装置
PCT/CN2017/106280 WO2019075599A1 (fr) 2017-10-16 2017-10-16 Procédé et dispositif de remplissage de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/106280 WO2019075599A1 (fr) 2017-10-16 2017-10-16 Procédé et dispositif de remplissage de données

Publications (1)

Publication Number Publication Date
WO2019075599A1 true WO2019075599A1 (fr) 2019-04-25

Family

ID=65863683

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106280 WO2019075599A1 (fr) 2017-10-16 2017-10-16 Procédé et dispositif de remplissage de données

Country Status (2)

Country Link
CN (1) CN109564641B (fr)
WO (1) WO2019075599A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704697A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 基于多元回归模型的医疗数据缺失处理方法、装置及设备
CN117453696A (zh) * 2023-12-07 2024-01-26 深圳拓安信物联股份有限公司 水表缺失数据的补全方法和装置

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276412A (zh) * 2019-06-28 2019-09-24 中煤科工集团重庆研究院有限公司 一种瓦斯监控数据无序填补方法
CN111061999B (zh) * 2019-11-19 2023-08-22 平安科技(深圳)有限公司 数据样本获取方法、装置及存储介质
CN113468152A (zh) * 2021-06-04 2021-10-01 国网上海市电力公司 高频用户用电数据清洗方法、系统、设备及存储介质
CN113672871A (zh) * 2021-08-23 2021-11-19 广东电网有限责任公司 一种高比例缺失数据填补方法及相关装置
CN113742326B (zh) * 2021-09-01 2024-04-12 阳光电源股份有限公司 功率优化器及其功率缺失值填充方法、装置
CN113851191A (zh) * 2021-09-06 2021-12-28 中科曙光国际信息产业有限公司 基因填充方法、装置、计算机设备和存储介质
CN113850523A (zh) * 2021-09-29 2021-12-28 平安科技(深圳)有限公司 基于数据补全的esg指数确定方法及相关产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218468A1 (en) * 2005-03-09 2006-09-28 Matsushita Electric Industrial Co., Ltd. Memory initialization device, memory initialization method, and error correction device
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models
CN103440283A (zh) * 2013-08-13 2013-12-11 江苏华大天益电力科技有限公司 一种测点数据的补缺系统及补缺方法
CN104392400A (zh) * 2014-12-10 2015-03-04 国家电网公司 一种电力营销缺失数据补全方法
CN105468594A (zh) * 2014-08-11 2016-04-06 中兴通讯股份有限公司 一种采集数据的优化方法、系统及服务器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786860B (zh) * 2014-12-23 2020-07-07 华为技术有限公司 一种数据建模中的数据处理方法及装置
CN105488736A (zh) * 2015-12-02 2016-04-13 国家电网公司 一种用于光伏电站数据采集系统的数据处理方法
CN106919957B (zh) * 2017-03-10 2020-03-10 广州视源电子科技股份有限公司 处理数据的方法及装置
CN107193876B (zh) * 2017-04-21 2020-10-09 美林数据技术股份有限公司 一种基于最近邻knn算法的缺失数据填补方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218468A1 (en) * 2005-03-09 2006-09-28 Matsushita Electric Industrial Co., Ltd. Memory initialization device, memory initialization method, and error correction device
US20130226838A1 (en) * 2012-02-23 2013-08-29 International Business Machines Corporation Missing value imputation for predictive models
CN103440283A (zh) * 2013-08-13 2013-12-11 江苏华大天益电力科技有限公司 一种测点数据的补缺系统及补缺方法
CN105468594A (zh) * 2014-08-11 2016-04-06 中兴通讯股份有限公司 一种采集数据的优化方法、系统及服务器
CN104392400A (zh) * 2014-12-10 2015-03-04 国家电网公司 一种电力营销缺失数据补全方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704697A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 基于多元回归模型的医疗数据缺失处理方法、装置及设备
CN113704697B (zh) * 2021-08-31 2023-12-26 平安科技(深圳)有限公司 基于多元回归模型的医疗数据缺失处理方法、装置及设备
CN117453696A (zh) * 2023-12-07 2024-01-26 深圳拓安信物联股份有限公司 水表缺失数据的补全方法和装置
CN117453696B (zh) * 2023-12-07 2024-04-12 深圳拓安信物联股份有限公司 水表缺失数据的补全方法和装置

Also Published As

Publication number Publication date
CN109564641B (zh) 2023-08-25
CN109564641A (zh) 2019-04-02

Similar Documents

Publication Publication Date Title
CN109564641B (zh) 数据填补方法和装置
US11256555B2 (en) Automatically scalable system for serverless hyperparameter tuning
WO2017133492A1 (fr) Procédé et système d'évaluation de risque
EP3591586A1 (fr) Génération de modèles de données à l'aide de réseaux contradictoires génératifs et système d'apprentissage machine totalement automatique qui génère et solutions optimise des solutions données d'un ensemble de données et un résultat souhaité
WO2019136990A1 (fr) Procédé de détection de données de réseau, appareil, dispositif informatique et support de stockage
US20140314311A1 (en) System and method for classification with effective use of manual data input
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
CN112420187B (zh) 一种基于迁移联邦学习的医疗疾病分析方法
CN111612041A (zh) 异常用户识别方法及装置、存储介质、电子设备
WO2022247955A1 (fr) Procédé, appareil et dispositif d'identification de compte anormal, et support de stockage
CN107633257B (zh) 数据质量评估方法及装置、计算机可读存储介质、终端
WO2017071474A1 (fr) Procédé et dispositif pour traiter des éléments de données de langue et procédé et dispositif pour analyser des éléments de données de langue
Satyanarayana Intelligent sampling for big data using bootstrap sampling and chebyshev inequality
CN114936168B (zh) 一种真实用户智能感知系统中的测试用例自动生成方法
US20190220924A1 (en) Method and device for determining key variable in model
WO2021174699A1 (fr) Procédé, appareil et dispositif de criblage d'utilisateur, et support de stockage
Wang et al. Discovering truths from distributed data
JP2021068448A5 (fr)
WO2020233067A1 (fr) Procédé et appareil de partage de données fondés sur le comportement d'utilisateur , et dispositif informatique
WO2022022042A1 (fr) Procédé et appareil de rapport de données de surveillance, dispositif informatique et support d'enregistrement
JPWO2019168599A5 (fr)
WO2023035526A1 (fr) Procédé de tri d'objets, dispositif associé et support
CN110083517B (zh) 一种用户画像置信度的优化方法及装置
CN110837847A (zh) 用户分类方法及装置、存储介质、服务器
CN110837459A (zh) 一种基于大数据的运行绩效分析方法及系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17928937

Country of ref document: EP

Kind code of ref document: A1