WO2023173548A1 - Data equalization method and apparatus, and electronic device and storage medium - Google Patents

Data equalization method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2023173548A1
WO2023173548A1 PCT/CN2022/090170 CN2022090170W WO2023173548A1 WO 2023173548 A1 WO2023173548 A1 WO 2023173548A1 CN 2022090170 W CN2022090170 W CN 2022090170W WO 2023173548 A1 WO2023173548 A1 WO 2023173548A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
class
majority
minority
cluster
Prior art date
Application number
PCT/CN2022/090170
Other languages
French (fr)
Chinese (zh)
Inventor
王彦
谢淋
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023173548A1 publication Critical patent/WO2023173548A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Definitions

  • This application relates to the field of big data technology, specifically to a data equalization method, device, electronic equipment and storage medium.
  • Table data is a very common data format.
  • a table generally contains multiple fields, and each field has a clear meaning.
  • the customer information table in the customer loan fraud prediction scenario can record the customer's "age”, “gender”, “education level”, “loan amount” and other variables. These variables are called independent variables, "whether there is a default"
  • the target variable needs to predict the value based on the value of the independent variable.
  • many models assume that the data distribution of the target variable is balanced, but in reality the tabular data distribution is uneven. For example, in the problem of customer loan fraud prediction, the proportion of customers who commit loan fraud is often Less than 10%, while non-fraudulent customers account for more than 90%. Therefore, the inventor realized that it is very necessary to equalize the table data.
  • the purpose of this application is to propose a data equalization method, device, electronic equipment and storage medium in view of the above-mentioned shortcomings of the prior art. This purpose is achieved through the following technical solutions.
  • the first aspect of this application proposes a data equalization method, which method includes:
  • Each piece of data includes the value of the independent variable and the value of the target variable;
  • Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
  • the undersampling result and the oversampling result are combined to obtain equalized data.
  • the second aspect of this application proposes a data equalization device, which includes:
  • the data processing module is used to convert the variable values contained in each piece of data in the table into numerical values.
  • Each piece of data includes the value of the independent variable and the value of the target variable;
  • the division module is used to divide the data in the table into majority classes and minority classes according to the value of the target variable
  • An under-sampling module used to cluster the data in the majority class, and perform under-sampling extraction of the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
  • An oversampling module used to oversample the data in the minority class using a preset random perturbation strategy to obtain the oversampling result of the minority class;
  • a merging module is used to combine the undersampling results and the oversampling results to obtain equalized data.
  • the third aspect of the application proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the program:
  • Each piece of data includes the value of the independent variable and the value of the target variable;
  • Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
  • the undersampling result and the oversampling result are combined to obtain equalized data.
  • a fourth aspect of the present application proposes a computer-readable storage medium on which a computer program is stored, wherein the following steps are implemented when the program is executed by a processor:
  • Each piece of data includes the value of the independent variable and the value of the target variable;
  • Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
  • the undersampling result and the oversampling result are combined to obtain equalized data.
  • this scheme undersamples the majority class data, it undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted data is more representative of the majority class and is better than random sampling. Better results.
  • this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.
  • Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment
  • Figure 2 is a schematic diagram of an elbow rule shown in this application according to an exemplary embodiment
  • Figure 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application.
  • Figure 4 is a schematic diagram of the hardware structure of an electronic device according to an exemplary embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a storage medium according to an exemplary embodiment of the present application.
  • first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • the common balancing strategies used to solve the problem of table data imbalance include:
  • the data is randomly oversampled, that is, some minority class samples are randomly repeated.
  • the data is randomly undersampled, that is, some majority class samples are randomly deleted.
  • 3 Generate pseudo data based on the idea of interpolation of nearest neighbor samples, such as the SMOTE algorithm.
  • 4 Under-sampling methods based on data classification contribution, such as One-Sided Selection, this method believes that points on the classification boundary are often more important for building a classification model.
  • Random oversampling achieves sample balance by randomly repeating some minority class samples. Although the execution speed is fast, direct sample replication will increase the risk of model overfitting. .
  • Random undersampling by randomly deleting some majority class samples, faces the risk of losing information on majority class samples, and the model may be underfitted.
  • 3 The idea of interpolating to generate pseudo data based on nearest neighbor samples is inefficient when the data dimension is high or the sample size is huge. In tabular data scenarios, each field has a specific meaning. However, the meaning of the generated pseudo data values is difficult to interpret. . 4 Under-sampling based on the contribution of samples to classification can effectively eliminate redundant and noise points in most classes, making the classification boundaries clear. However, due to the high complexity of the algorithm, the operating efficiency is relatively low on large data sets.
  • the data equalization method proposed in this patent achieves better modeling effects by simultaneously undersampling majority class samples and oversampling minority class samples.
  • the specific implementation process is: convert the variable values contained in each piece of data in the table into numerical values.
  • Each piece of data includes the value of the independent variable and the value of the target variable, and divide the data in the table into Majority class and minority class, then cluster the data in the majority class, and perform undersampling on the majority class data according to the data proportion of each cluster to obtain the undersampling result of the majority class, and use preset random perturbations
  • the strategy oversamples the data in the minority class to obtain the oversampling result of the minority class, and finally combines the undersampling result and the oversampling result to obtain balanced data.
  • this scheme undersamples the majority class data, it undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted data is more representative of the majority class and is better than random sampling. Better results.
  • this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.
  • Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment.
  • the data in the table is used as model training data as an example for illustration.
  • One item in the table The data is used as a sample, and the data in the entire table is used as a training set.
  • the data equalization method includes the following steps:
  • Step 101 Convert the variable values contained in each piece of data in the table into numerical values.
  • Each piece of data includes the value of the independent variable and the value of the target variable.
  • each table contains multiple fields, and each field has a clear meaning.
  • the customer information table in the customer loan fraud prediction scenario can record the customer's "age”, “gender”, “education level”, Variables such as “loan amount” are called independent variables, and "whether there is a default" is the target variable. Both independent variables and target variables will have non-numeric variables. For subsequent data processing, non-numeric variables need to be The variable value is converted into a numeric value.
  • variable values contained in each piece of data are data-encoded to convert them into numerical values.
  • one-hot encoding is performed on discrete variables that do not have a large or small relationship.
  • the value of the "gender” variable is "male” and “female”.
  • a new variable is formed: “gender_male”, “Gender_Female”, each new variable has a value of 0 or 1.
  • the value of the "education degree” variable is "Doctorate”, “Master's”, “Bachelor's degree”, “College degree", “High school and below”. After conversion, the corresponding relationship between the values is: High school and below: 0, junior college: 1, undergraduate: 2, master: 3, doctorate: 4.
  • Step 102 Divide the data in the table into majority classes and minority classes according to the value of the target variable.
  • the specific division method is: count the number of values of each target variable, compare the number of values of the two target variables, and divide the larger number The data belonging to the target variable value is divided into the majority class, and the data belonging to the target variable value with a small number is divided into the minority class.
  • the target variable is "whether there is a breach of contract".
  • the variable value 1 indicates breach of contract, and the variable value 0 indicates no breach of contract.
  • Step 103 Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class.
  • the elbow rule in the process of clustering the data in the majority class, can be used to perform a cluster number search on the data in the majority class, so that the data in the majority class can be clustered according to the number of clusters searched. Data in most classes are clustered.
  • the clustering algorithm can use the K-Means clustering algorithm.
  • the key is to select a more appropriate K value.
  • the sum of squared distances (SSE) from the sample points in the K clusters to the cluster centroid is used as the clustering effect measure. Index, the smaller the SSE is, the more convergent each cluster is. However, it is not that the smaller the SSE, the better, because an extreme case is to regard all sample points as clusters, in which case the SSE is 0, but it is meaningless.
  • the appropriate number of clusters K is selected through the elbow rule.
  • N the maximum possible number of clusters
  • N the number of clusters from 1 to N
  • N SSEs the number of clusters from 1 to N
  • SSE shows a rapid decline.
  • SSE will continue to decline, but the decline rate tends to be slow.
  • the clustering number K can be directly used for subsequent clustering.
  • each sample will be assigned a cluster number after clustering, and then the number of samples owned by each cluster and the proportion of each cluster will be counted to obtain each cluster. proportion of data.
  • the data proportion of the data contained in each cluster in the majority class can be determined, and Determine the number of undersampled extractions based on the preset equilibrium proportion and the total amount of data in the majority class, and then use the number of undersampled extractions and the data proportion of each cluster to extract data from the corresponding clusters, and extract data from each cluster.
  • the data extracted from each cluster is determined to be the undersampling result of the majority class.
  • the preset equilibrium ratio is the ideal ratio between the majority class and the minority class. Assume that the majority class in the training set has 400 samples, the number of minority class samples in the training set is 10, and the current number ratio between the majority class and the minority class is 40 : 1. If you want the ratio of the majority class to the minority class to be reduced to 10:1 after equalization, you need to extract 100 samples from the majority class samples in the training set, that is, the number of undersampled samples is 100.
  • Step 104 Use a preset random perturbation strategy to oversample the data in the minority class to obtain an oversampling result of the minority class.
  • the oversampling extraction process of the minority class data with a preset oversampling ratio is extracted from the minority class, and a preset random perturbation strategy is used to perform random perturbation processing on the extracted data. Then the perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
  • the model is prevented from overfitting on the minority class samples, so that when the extracted data are randomly perturbed, for each piece of extracted data, the preset perturbation ratio is used to Some variable values in this piece of data are replaced with preset disturbance values.
  • MASK is represented by using a larger number (such as 999).
  • the tabular data has a total of 10 features (independent variables) and 1 and the target variable (Y). If the MASK ratio is set to 20%, 2 random MASK features are required, that is, the original value is replaced with 999.
  • Step 105 Combine the undersampling results and the oversampling results to obtain equalized data.
  • the balanced data is used as a training data set to train the artificial intelligence model, and the generalization ability of the model is verified on the test set.
  • this application also provides an embodiment of a data equalization device.
  • FIG 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application.
  • the device is used to perform the data equalization method provided in any of the above embodiments.
  • the data equalization Devices include:
  • the data processing module 310 is used to convert the variable values contained in each piece of data in the table into numerical values.
  • Each piece of data includes an independent variable value and a target variable value;
  • the dividing module 320 is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;
  • the under-sampling module 330 is used to cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
  • the oversampling module 340 is configured to use a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
  • the merging module 350 is used to combine the undersampling results and the oversampling results to obtain equalized data.
  • the data processing module 310 is specifically used to fill in the missing variable values in each piece of data; perform data encoding on the variable values contained in each piece of data to convert them into numerical values. Take value.
  • the target variable values include two; the dividing module 320 is specifically used to count the number of values of each target variable and compare the number of values of the two target variables; Divide the data with a large number of target variable values into the majority class; divide the data with a small number of target variable values into the minority class.
  • the under-sampling module 330 is specifically used to under-sample the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class.
  • the data proportion of the data contained in the cluster in the majority class determine the number of under-sampling extractions based on the preset equilibrium ratio and the total amount of data in the majority class; use the number of under-sampling extractions and the number of each cluster
  • Data proportion extracts data from the corresponding clusters; the data extracted from each cluster is determined as the undersampled result of the majority class.
  • the oversampling module 340 is specifically used to extract data with a preset oversampling ratio from the minority class; and use a preset random perturbation strategy to perform random perturbation processing on the extracted data. ; Determine the perturbed data and the data in the minority class as the oversampling result of the minority class.
  • the oversampling module 340 is specifically used to perform random perturbation processing on the extracted data using a preset random perturbation strategy. For each piece of extracted data, the preset perturbation Proportion replaces some variable values in this piece of data with preset disturbance values.
  • the undersampling module 330 is specifically used to perform a cluster number search on the data in the majority class through the elbow rule during the process of clustering the data in the majority class. ;Cluster the data in the majority class according to the number of clusters searched.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • An embodiment of the present application also provides an electronic device corresponding to the data equalization method provided in the previous embodiment to execute the above data equalization method.
  • FIG. 4 is a hardware structure diagram of an electronic device according to an exemplary embodiment of the present application.
  • the electronic device includes: a communication interface 601, a processor 602, a memory 603 and a bus 604; wherein the communication interface 601, the processor 602 and the memory 603 complete communication with each other through the bus 604.
  • the processor 602 can execute the above-described data equalization method by reading and executing machine-executable instructions corresponding to the control logic of the data equalization method in the memory 603. For details of the method, please refer to the above embodiments and will not be discussed here. Again.
  • the memory 603 mentioned in this application can be any electronic, magnetic, optical or other physical storage device, and can contain stored information, such as executable instructions, data, and so on.
  • the memory 603 can be RAM (Random Access Memory), flash memory, a storage drive (such as a hard drive), any type of storage disk (such as an optical disk, DVD, etc.), or similar storage media, or they The combination.
  • the communication connection between the system network element and at least one other network element is realized through at least one communication interface 601 (which can be wired or wireless), and the Internet, wide area network, local network, metropolitan area network, etc. can be used.
  • the bus 604 may be an ISA bus, a PCI bus, an EISA bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the memory 603 is used to store a program, and the processor 602 executes the program after receiving the execution instruction.
  • the processor 602 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 602 .
  • the above-mentioned processor 602 can be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the electronic device provided by the embodiments of the present application and the data equalization method provided by the embodiments of the present application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, run or implemented.
  • the embodiment of the present application also provides a computer-readable storage medium corresponding to the data equalization method provided by the previous embodiment. Please refer to FIG. 5.
  • the computer-readable storage medium shown is an optical disk 30, on which is stored There is a computer program (ie, a program product). When the computer program is run by a processor, the computer program will execute the data equalization method provided by any of the foregoing embodiments.
  • the computer-readable storage medium may be non-volatile or volatile.
  • examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other optical and magnetic storage media will not be described in detail here.
  • the computer-readable storage medium provided by the above embodiments of the present application is based on the same inventive concept as the data equalization method provided by the embodiments of the present application, and has the same beneficial effects as the methods used, run or implemented by the applications stored therein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Technology Law (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Disclosed in the present application are a data equalization method and apparatus, and an electronic device and a storage medium. The method comprises: converting, into a numerical value, a variable value included in each piece of data in a table, wherein each piece of data comprises an independent variable value and a target variable value; dividing the data in the table into a majority class and a minority class according to the target variable value; clustering data in the majority class, and performing undersampling extraction on the data in the majority class according to the data proportion of each cluster, so as to obtain an undersampling result of the majority class; performing oversampling extraction on data in the minority class by using a preset random perturbation policy, so as to obtain an oversampling result of the minority class; and combining the undersampling result and the oversampling result, so as to obtain equalized data. Undersampling extraction is realized by means of clustering data in a majority class, such that extracted data has a relatively strong representativeness for the majority class. In addition, undersampling extraction is realized by means of a random perturbation policy, such that the problem of overfitting of subsequent model training that is caused by simple repetition of data in a minority class can be avoided.

Description

一种数据均衡化方法、装置、电子设备及存储介质A data equalization method, device, electronic equipment and storage medium
优先权申明priority statement
本申请要求于2022年3月16日提交中国专利局、申请号为202210258472.1,发明名称为“一种数据均衡化方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on March 16, 2022, with the application number 202210258472.1, and the invention name is "A data equalization method, device, electronic equipment and storage medium", and its entire content incorporated herein by reference.
技术领域Technical field
本申请涉及大数据技术领域,具体涉及一种数据均衡化方法、装置、电子设备及存储介质。This application relates to the field of big data technology, specifically to a data equalization method, device, electronic equipment and storage medium.
背景技术Background technique
表格数据是一种十分常见的数据格式,一个表格一般包含多个字段,每个字段有明确的含义。比如,客户贷款欺诈预测场景下的客户信息表,该表可以记录客户的“年龄”、“性别”、“教育水平”、“贷款金额”等变量,这些变量称为自变量,“是否违约”为目标变量,目标变量需要依据自变量的取值预测取值。在对表格数据进行建模时,许多模型都假设了目标变量的数据分布是均衡的,但是现实中表格数据分布是不均衡的,例如客户贷款欺诈预测问题中,进行贷款欺诈的客户占比往往不到10%,非欺诈客户占比则有90%以上。因此,发明人意识到对表格数据的均衡化处理显得十分必要。Table data is a very common data format. A table generally contains multiple fields, and each field has a clear meaning. For example, the customer information table in the customer loan fraud prediction scenario can record the customer's "age", "gender", "education level", "loan amount" and other variables. These variables are called independent variables, "whether there is a default" As the target variable, the target variable needs to predict the value based on the value of the independent variable. When modeling tabular data, many models assume that the data distribution of the target variable is balanced, but in reality the tabular data distribution is uneven. For example, in the problem of customer loan fraud prediction, the proportion of customers who commit loan fraud is often Less than 10%, while non-fraudulent customers account for more than 90%. Therefore, the inventor realized that it is very necessary to equalize the table data.
发明内容Contents of the invention
本申请的目的是针对上述现有技术的不足提出的一种数据均衡化方法、装置、电子设备及存储介质,该目的是通过以下技术方案实现的。The purpose of this application is to propose a data equalization method, device, electronic equipment and storage medium in view of the above-mentioned shortcomings of the prior art. This purpose is achieved through the following technical solutions.
本申请的第一方面提出了一种数据均衡化方法,所述方法包括:The first aspect of this application proposes a data equalization method, which method includes:
将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
本申请的第二方面提出了一种数据均衡化装置,所述装置包括:The second aspect of this application proposes a data equalization device, which includes:
数据处理模块,用于将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;The data processing module is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
划分模块,用于按照目标变量取值将表格中的数据划分为多数类和少数类;The division module is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;
欠采样模块,用于对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;An under-sampling module, used to cluster the data in the majority class, and perform under-sampling extraction of the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
过采样模块,用于采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;An oversampling module, used to oversample the data in the minority class using a preset random perturbation strategy to obtain the oversampling result of the minority class;
合并模块,用于合并所述欠采样结果和所述过采样结果,得到均衡后的数据。A merging module is used to combine the undersampling results and the oversampling results to obtain equalized data.
本申请的第三方面提出了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如下步骤:The third aspect of the application proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the program:
将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
本申请的第四方面提出了一种计算机可读存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如下步骤:A fourth aspect of the present application proposes a computer-readable storage medium on which a computer program is stored, wherein the following steps are implemented when the program is executed by a processor:
将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
基于上述第一方面和第二方面所述的数据均衡化方法及装置,本申请至少具有如下有益效果或优点:Based on the data equalization method and device described in the first and second aspects above, this application has at least the following beneficial effects or advantages:
本方案在对多数类数据进行欠采样时,根据聚类后各个聚类的数据占比来对多数类数据进行欠采样抽取,使得抽取的数据对多数类具有较强的代表性,比随机抽取效果更好。在对少数类数据进行过采样时,本方案采用了随机扰动策略对抽取数据进行随机扰动,可以避免对少数类数据简单重复造成后续模型训练过拟合问题,同时,随机扰动执行效率高,在上万条数据中进行随机扰动可以秒级响应。采用本方案对大规模、严重不均衡的表格数据进行数据均衡化处理效果显著,可以明显提升少数类数据的召回率和精准率。When this scheme undersamples the majority class data, it undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted data is more representative of the majority class and is better than random sampling. Better results. When oversampling minority class data, this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:
图1为本申请根据一示例性实施例示出的一种数据均衡化方法的实施例流程图;Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment;
图2为本申请根据一示例性实施例示出的一种肘部法则示意图;Figure 2 is a schematic diagram of an elbow rule shown in this application according to an exemplary embodiment;
图3为本申请根据一示例性实施例示出的一种数据均衡化装置的结构示意图;Figure 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application;
图4为本申请根据一示例性实施例示出的一种电子设备的硬件结构示意图图;Figure 4 is a schematic diagram of the hardware structure of an electronic device according to an exemplary embodiment of the present application;
图5为本申请根据一示例性实施例示出的一种存储介质的结构示意图。Figure 5 is a schematic structural diagram of a storage medium according to an exemplary embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the appended claims.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present application, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."
目前,针对表格数据不均衡问题常采用的均衡化策略包括:At present, the common balancing strategies used to solve the problem of table data imbalance include:
①数据随机过采样,即随机重复部分少数类样本。②数据随机欠采样,即随机删除部分多数类样本。③基于近邻样本进行插值的思想生成伪数据,如SMOTE算法。④基于数据分类贡献度的欠采样方法,如One-Sided Selection,该方法认为在分类边界上的点往往对构建分类模型更重要。①The data is randomly oversampled, that is, some minority class samples are randomly repeated. ②The data is randomly undersampled, that is, some majority class samples are randomly deleted. ③ Generate pseudo data based on the idea of interpolation of nearest neighbor samples, such as the SMOTE algorithm. ④ Under-sampling methods based on data classification contribution, such as One-Sided Selection, this method believes that points on the classification boundary are often more important for building a classification model.
然而,实验发现以上方法各有不足之处,具体而言:①随机过采样通过随机重复部分少数类样本来实现样本均衡,虽然执行速度快,但是直接进行样本复制会增加模型过拟合的风险。②随机欠采样通过随机删除部分多数类样本,则面临着对多数类样本信息丢失的风险,模型可能欠拟合。③基于近邻样本进行插值生成伪数据的思想在数据维度较高或样本量巨大时运行效率低下,并且在表格数据场景下,每个字段具有特定含义,然而生成的伪数据取值的含义难以解释。④基于样本对分类贡献度进行的欠采样,可以有效剔除多数类中的冗余和噪声点,使得分类边界清晰,但是由于算法复杂度高,在大数据集上运行效率比较低。However, experiments have found that each of the above methods has shortcomings. Specifically: ① Random oversampling achieves sample balance by randomly repeating some minority class samples. Although the execution speed is fast, direct sample replication will increase the risk of model overfitting. . ② Random undersampling, by randomly deleting some majority class samples, faces the risk of losing information on majority class samples, and the model may be underfitted. ③The idea of interpolating to generate pseudo data based on nearest neighbor samples is inefficient when the data dimension is high or the sample size is huge. In tabular data scenarios, each field has a specific meaning. However, the meaning of the generated pseudo data values is difficult to interpret. . ④ Under-sampling based on the contribution of samples to classification can effectively eliminate redundant and noise points in most classes, making the classification boundaries clear. However, due to the high complexity of the algorithm, the operating efficiency is relatively low on large data sets.
基于此,本专利提出的数据均衡化方法通过同时对多数类杨欠采样,对少数类样本过采样,以实现更好的建模效果。Based on this, the data equalization method proposed in this patent achieves better modeling effects by simultaneously undersampling majority class samples and oversampling minority class samples.
具体实现过程为:将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值,并按照目标变量取值将表格中的数据划分为多数类和少数类,然后对多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果,并采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果,最后合并欠采样结果和过采样结果,得到均衡后的数据。The specific implementation process is: convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable, and divide the data in the table into Majority class and minority class, then cluster the data in the majority class, and perform undersampling on the majority class data according to the data proportion of each cluster to obtain the undersampling result of the majority class, and use preset random perturbations The strategy oversamples the data in the minority class to obtain the oversampling result of the minority class, and finally combines the undersampling result and the oversampling result to obtain balanced data.
基于上述描述可达到的技术效果有:The technical effects that can be achieved based on the above description are:
本方案在对多数类数据进行欠采样时,根据聚类后各个聚类的数据占比来对多数类数据进行欠采样抽取,使得抽取的数据对多数类具有较强的代表性,比随机抽取效果更好。在对少数类数据进行过采样时,本方案采用了随机扰动策略对 抽取数据进行随机扰动,可以避免对少数类数据简单重复造成后续模型训练过拟合问题,同时,随机扰动执行效率高,在上万条数据中进行随机扰动可以秒级响应。采用本方案对大规模、严重不均衡的表格数据进行数据均衡化处理效果显著,可以明显提升少数类数据的召回率和精准率。When this scheme undersamples the majority class data, it undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted data is more representative of the majority class and is better than random sampling. Better results. When oversampling minority class data, this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.
为了使本领域技术人员更好的理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the present application will be clearly and completely described below in conjunction with the drawings in the embodiment of the present application.
实施例一:Example 1:
图1为本申请根据一示例性实施例示出的一种数据均衡化方法的实施例流程图,在本实施例中,以表格中数据作为模型训练数据为例进行示例性说明,表格中的一条数据作为一个样本,整个表格中的数据作为一个训练集,该数据均衡化方法包括如下步骤:Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment. In this embodiment, the data in the table is used as model training data as an example for illustration. One item in the table The data is used as a sample, and the data in the entire table is used as a training set. The data equalization method includes the following steps:
步骤101:将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值。Step 101: Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable.
其中,一个表格均包含多个字段,每个字段有明确的含义,如,客户贷款欺诈预测场景下的客户信息表,该表可以记录客户的“年龄”、“性别”、“教育水平”、“贷款金额”等变量,这些变量称为自变量,“是否违约”为目标变量,无论是自变量还是目标变量均会有非数值型变量取值,为了后续的数据处理,需要将非数值型变量取值转换为数值型取值。Among them, each table contains multiple fields, and each field has a clear meaning. For example, the customer information table in the customer loan fraud prediction scenario can record the customer's "age", "gender", "education level", Variables such as "loan amount" are called independent variables, and "whether there is a default" is the target variable. Both independent variables and target variables will have non-numeric variables. For subsequent data processing, non-numeric variables need to be The variable value is converted into a numeric value.
具体转化过程如下:The specific conversion process is as follows:
首先,对每条数据中缺失的变量取值进行填充。具体地,对于连续型变量采用该变量的中位数填充,对于离散型变量采用该变量的众数填充。First, fill in the missing variable values in each piece of data. Specifically, for continuous variables, the median of the variable is used to fill, and for discrete variables, the mode of the variable is used to fill.
然后,将每条数据包含的变量取值进行数据编码,以转换为数值型取值。具体地,对不具有大小关系的离散型变量进行独热编码,例如“性别”变量取值为“男”、“女”,经过独热编码后形成1个新变量:“性别_男”、“性别_女”,每个新变量的取值为0或1。对具有大小关系的离散变量转换为数字,例如“学历”变量取值为“博士”、“硕士”、“本科”、“专科”、“高中及以下”,经过转换后取值对应关系为:高中及以下:0,专科:1,本科:2,硕士:3,博士:4。Then, the variable values contained in each piece of data are data-encoded to convert them into numerical values. Specifically, one-hot encoding is performed on discrete variables that do not have a large or small relationship. For example, the value of the "gender" variable is "male" and "female". After one-hot encoding, a new variable is formed: "gender_male", "Gender_Female", each new variable has a value of 0 or 1. Convert discrete variables with large and small relationships into numbers. For example, the value of the "education degree" variable is "Doctorate", "Master's", "Bachelor's degree", "College degree", "High school and below". After conversion, the corresponding relationship between the values is: High school and below: 0, junior college: 1, undergraduate: 2, master: 3, doctorate: 4.
步骤102:按照目标变量取值将表格中的数据划分为多数类和少数类。Step 102: Divide the data in the table into majority classes and minority classes according to the value of the target variable.
在一可选的实施例中,在目标变量取值包括两个时,具体划分方式为:统计每个目标变量取值的数量,将两个目标变量取值的数量进行比较,将数量大的目标变量取值所属数据划分为多数类,将数量小的目标变量取值所属数据划分为少数类。In an optional embodiment, when the target variable values include two, the specific division method is: count the number of values of each target variable, compare the number of values of the two target variables, and divide the larger number The data belonging to the target variable value is divided into the majority class, and the data belonging to the target variable value with a small number is divided into the minority class.
例如目标变量为“是否违约”,变量取值1表示违约,变量取值0表示没违约,假设目标变量取值1有400条样本,目标变量取值0有10条样本,那么多数类:少数类=40:1。For example, the target variable is "whether there is a breach of contract". The variable value 1 indicates breach of contract, and the variable value 0 indicates no breach of contract. Assume that the target variable value 1 has 400 samples, and the target variable value 0 has 10 samples. Then the majority category: minority Class=40:1.
步骤103:对多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果。Step 103: Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class.
在一可选实施例中,对多数类中的数据进行聚类的过程,可以通过肘部法则对所述多数类中的数据进行聚类数搜索,从而按照搜索到的聚类数对所述多数类中的数据进行聚类。In an optional embodiment, in the process of clustering the data in the majority class, the elbow rule can be used to perform a cluster number search on the data in the majority class, so that the data in the majority class can be clustered according to the number of clusters searched. Data in most classes are clustered.
具体地,聚类算法可以采用K-Means聚类算法,其关键是选取较为合适的K值,一般使用K个簇内的样本点到所在簇质心的距离平方和(SSE)作为聚类效果度量指标,SSE越小则说明各个簇越收敛。但是,并不是SSE越小越好,因为一种极端情况是将所有的样本点均视作簇,这样的话SSE为0,但是没有意义。 我们需要在簇的数量K与SSE之间寻求一个平衡点。Specifically, the clustering algorithm can use the K-Means clustering algorithm. The key is to select a more appropriate K value. Generally, the sum of squared distances (SSE) from the sample points in the K clusters to the cluster centroid is used as the clustering effect measure. Index, the smaller the SSE is, the more convergent each cluster is. However, it is not that the smaller the SSE, the better, because an extreme case is to regard all sample points as clusters, in which case the SSE is 0, but it is meaningless. We need to find a balance between the number of clusters K and SSE.
因此,通过肘部法则选择合适的聚类数K。首先,指定可能的最大簇数N。然后将簇数从1开始递增,一直到N,计算出N个SSE。实验表明,当设定的簇数不断逼近数据真实簇数时,SSE呈现快速下降态势,而当设定的簇数超过真实簇数时,SSE也会继续下降,但下降速度趋于缓慢。根据N个SSE值画出SSE曲线,找出下降过程中的拐点,即为较为合适的聚类K值。Therefore, the appropriate number of clusters K is selected through the elbow rule. First, specify the maximum possible number of clusters, N. Then increase the number of clusters from 1 to N, and calculate N SSEs. Experiments show that when the set number of clusters continues to approach the real number of clusters in the data, SSE shows a rapid decline. When the set number of clusters exceeds the real number of clusters, SSE will continue to decline, but the decline rate tends to be slow. Draw the SSE curve based on N SSE values and find the inflection point in the descending process, which is the more appropriate clustering K value.
如图2所示,SSE的拐点在K=3的时候出现,因此,对于该数据集而言,使用K-Means聚类时,聚类数取3较为合适。As shown in Figure 2, the inflection point of SSE appears when K=3. Therefore, for this data set, when using K-Means clustering, the number of clusters is 3.
需要说明的是,为了提升均衡效率,首次确定聚类数K之后,后续可以直接利用该聚类数K进行聚类。It should be noted that in order to improve the balancing efficiency, after the clustering number K is determined for the first time, the clustering number K can be directly used for subsequent clustering.
进一步地,在对多数类中的数据进行聚类之后,会赋予每个样本一个聚类后的簇编号,然后统计每个簇拥有的样本数量和每个簇的占比,从而得到各个聚类的数据占比。Furthermore, after clustering the data in the majority category, each sample will be assigned a cluster number after clustering, and then the number of samples owned by each cluster and the proportion of each cluster will be counted to obtain each cluster. proportion of data.
例如,训练集中多数类样本数量为400,设定聚类数K=3,假设聚类后每个簇中样本数量分别为200、100、100,那么每个簇的占比为0.5、0.25、0.25。For example, the number of majority class samples in the training set is 400, and the number of clusters K=3 is set. Assume that the number of samples in each cluster after clustering is 200, 100, and 100 respectively, then the proportion of each cluster is 0.5, 0.25, 0.25.
在一可选的实施例中,针对根据各个聚类的数据占比对多数类数据进行欠采样抽取的过程,可以确定每个聚类包含的数据在所述多数类中的数据占比,并根据预设均衡比例和所述多数类中的数据总量确定欠采样抽取数量,然后利用所述欠采样抽取数量和每个聚类的数据占比从相应聚类中抽取数据,并将从每个聚类中抽取的数据确定为多数类的欠采样结果。In an optional embodiment, for the process of undersampling the majority class data according to the data proportion of each cluster, the data proportion of the data contained in each cluster in the majority class can be determined, and Determine the number of undersampled extractions based on the preset equilibrium proportion and the total amount of data in the majority class, and then use the number of undersampled extractions and the data proportion of each cluster to extract data from the corresponding clusters, and extract data from each cluster. The data extracted from each cluster is determined to be the undersampling result of the majority class.
其中,预设均衡比例为多数类与少数类的理想比例,假设训练集的多数类有400个样本,训练集中少数类样本数量为10个,当前多数类与少数类之间的数量比是40:1,如果希望均衡化后多数类与少数类比例降为10:1,则需要从训练集的多数类样本中抽取出100个样本,也即欠采样抽取数量为100。Among them, the preset equilibrium ratio is the ideal ratio between the majority class and the minority class. Assume that the majority class in the training set has 400 samples, the number of minority class samples in the training set is 10, and the current number ratio between the majority class and the minority class is 40 : 1. If you want the ratio of the majority class to the minority class to be reduced to 10:1 after equalization, you need to extract 100 samples from the majority class samples in the training set, that is, the number of undersampled samples is 100.
进一步地,再根据各个聚类的数据占比,将总共需要抽取的多数类样本按比例分摊到每个簇上。例如,聚类数K=3,每个簇数量分别为200、100、100,那么簇占比为0.5、0.25、0.25。总共需要从训练集抽取100个多数类样本,那么,按比例分摊后,每个簇需要抽取的样本数量为:0.5*100,0.25*100,0.25*100。Furthermore, according to the data proportion of each cluster, the total majority class samples that need to be extracted are allocated to each cluster in proportion. For example, if the number of clusters K=3, and the numbers of each cluster are 200, 100, and 100 respectively, then the cluster proportions are 0.5, 0.25, and 0.25. A total of 100 majority class samples need to be extracted from the training set. Then, after proportional sharing, the number of samples that need to be extracted from each cluster is: 0.5*100, 0.25*100, 0.25*100.
然后,采用随机抽取的方式从每个簇中抽取出分摊到该簇的样本量,也即从相应聚类中抽取数据。Then, random sampling is used to extract the sample size allocated to the cluster from each cluster, that is, data is extracted from the corresponding cluster.
步骤104:采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果。Step 104: Use a preset random perturbation strategy to oversample the data in the minority class to obtain an oversampling result of the minority class.
在一具体实施例中,针对少数类的过采样抽取过程,具体从所述少数类中抽取预设过采样比例的数据,并采用预设的随机扰动策略对抽取出的数据进行随机扰动处理,然后将扰动处理后的数据和少数类中的数据确定为少数类的过采样结果。In a specific embodiment, for the oversampling extraction process of the minority class, data with a preset oversampling ratio is extracted from the minority class, and a preset random perturbation strategy is used to perform random perturbation processing on the extracted data. Then the perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
例如,训练集中共有10个少数类样本,设定过采样比例为0.8,则需要随机从训练集中抽取出8条少数类样本。For example, if there are 10 minority class samples in the training set and the oversampling ratio is set to 0.8, 8 minority class samples need to be randomly extracted from the training set.
进一步地,通过向样本中增加一定比例的噪声,避免模型在少数类样本上过拟合,从而对抽取出的数据进行随机扰动处理时,针对抽取出的每条数据,根据预设扰动比例将该条数据中的部分变量取值替换为预设的扰动值。Furthermore, by adding a certain proportion of noise to the samples, the model is prevented from overfitting on the minority class samples, so that when the extracted data are randomly perturbed, for each piece of extracted data, the preset perturbation ratio is used to Some variable values in this piece of data are replaced with preset disturbance values.
例如,设置一个MASK比例,如10%,则随机将每个样本的10%的变量取值替换为MASK。为了便于后续模型计算,通过采用一个较大的数字(如999)来表示MASK。For example, if you set a MASK ratio, such as 10%, then 10% of the variable values in each sample will be randomly replaced with MASK. In order to facilitate subsequent model calculations, MASK is represented by using a larger number (such as 999).
如表1所示,表格数据共10个特征(自变量)和1和目标变量(Y),设定MASK比例为20%,则需要随机MASK 2个特征,即将原始值替换为999。As shown in Table 1, the tabular data has a total of 10 features (independent variables) and 1 and the target variable (Y). If the MASK ratio is set to 20%, 2 random MASK features are required, that is, the original value is replaced with 999.
Figure PCTCN2022090170-appb-000001
Figure PCTCN2022090170-appb-000001
表1Table 1
步骤105:合并欠采样结果和过采样结果,得到均衡后的数据。Step 105: Combine the undersampling results and the oversampling results to obtain equalized data.
需要补充说明的是,将均衡后的数据作为训练数据集,对人工智能模型进行训练,并在测试集上验证模型的泛化能力。It should be added that the balanced data is used as a training data set to train the artificial intelligence model, and the generalization ability of the model is verified on the test set.
至此,完成上述图1所示的数据均衡化流程,本方案在对多数类数据进行欠采样时,根据聚类后各个聚类的数据占比来对多数类数据进行欠采样抽取,使得抽取的数据对多数类具有较强的代表性,比随机抽取效果更好。在对少数类数据进行过采样时,本方案采用了随机扰动策略对抽取数据进行随机扰动,可以避免对少数类数据简单重复造成后续模型训练过拟合问题,同时,随机扰动执行效率高,在上万条数据中进行随机扰动可以秒级响应。采用本方案对大规模、严重不均衡的表格数据进行数据均衡化处理效果显著,可以明显提升少数类数据的召回率和精准率。At this point, the data equalization process shown in Figure 1 above is completed. When undersampling the majority class data, this scheme undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted The data is highly representative of most classes and is better than random sampling. When oversampling minority class data, this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.
与前述数据均衡化方法的实施例相对应,本申请还提供了数据均衡化装置的实施例。Corresponding to the foregoing embodiments of the data equalization method, this application also provides an embodiment of a data equalization device.
图3为本申请根据一示例性实施例示出的一种数据均衡化装置的结构示意图,该装置用于执行上述任一实施例提供的数据均衡化方法,如图3所示,该数据均衡化装置包括:Figure 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application. The device is used to perform the data equalization method provided in any of the above embodiments. As shown in Figure 3, the data equalization Devices include:
数据处理模块310,用于将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;The data processing module 310 is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes an independent variable value and a target variable value;
划分模块320,用于按照目标变量取值将表格中的数据划分为多数类和少数类;The dividing module 320 is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;
欠采样模块330,用于对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;The under-sampling module 330 is used to cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
过采样模块340,用于采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;The oversampling module 340 is configured to use a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
合并模块350,用于合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The merging module 350 is used to combine the undersampling results and the oversampling results to obtain equalized data.
在一可选的实现方式中,所述数据处理模块310,具体用于对每条数据中缺失的变量取值进行填充;将每条数据包含的变量取值进行数据编码,以转换为数值型取值。In an optional implementation, the data processing module 310 is specifically used to fill in the missing variable values in each piece of data; perform data encoding on the variable values contained in each piece of data to convert them into numerical values. Take value.
在一可选的实现方式中,所述目标变量取值包括两个;所述划分模块320,具体用于统计每个目标变量取值的数量,将两个目标变量取值的数量进行比较;将数量大的目标变量取值所属数据划分为多数类;将数量小的目标变量取值所属数据划分为少数类。In an optional implementation, the target variable values include two; the dividing module 320 is specifically used to count the number of values of each target variable and compare the number of values of the two target variables; Divide the data with a large number of target variable values into the majority class; divide the data with a small number of target variable values into the minority class.
在一可选的实现方式中,所述欠采样模块330,具体用于在根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果过程中,确定 每个聚类包含的数据在所述多数类中的数据占比;根据预设均衡比例和所述多数类中的数据总量确定欠采样抽取数量;利用所述欠采样抽取数量和每个聚类的数据占比从相应聚类中抽取数据;将从每个聚类中抽取的数据确定为多数类的欠采样结果。In an optional implementation, the under-sampling module 330 is specifically used to under-sample the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class. The data proportion of the data contained in the cluster in the majority class; determine the number of under-sampling extractions based on the preset equilibrium ratio and the total amount of data in the majority class; use the number of under-sampling extractions and the number of each cluster Data proportion extracts data from the corresponding clusters; the data extracted from each cluster is determined as the undersampled result of the majority class.
在一可选的实现方式中,所述过采样模块340,具体用于从所述少数类中抽取预设过采样比例的数据;采用预设的随机扰动策略对抽取出的数据进行随机扰动处理;将扰动处理后的数据和少数类中的数据确定为少数类的过采样结果。In an optional implementation, the oversampling module 340 is specifically used to extract data with a preset oversampling ratio from the minority class; and use a preset random perturbation strategy to perform random perturbation processing on the extracted data. ; Determine the perturbed data and the data in the minority class as the oversampling result of the minority class.
在一可选的实现方式中,所述过采样模块340,具体用于采用预设的随机扰动策略对抽取出的数据进行随机扰动处理过程中,针对抽取出的每条数据,根据预设扰动比例将该条数据中的部分变量取值替换为预设的扰动值。In an optional implementation, the oversampling module 340 is specifically used to perform random perturbation processing on the extracted data using a preset random perturbation strategy. For each piece of extracted data, the preset perturbation Proportion replaces some variable values in this piece of data with preset disturbance values.
在一可选的实现方式中,所述欠采样模块330,具体用于对所述多数类中的数据进行聚类过程中,通过肘部法则对所述多数类中的数据进行聚类数搜索;按照搜索到的聚类数对所述多数类中的数据进行聚类。In an optional implementation, the undersampling module 330 is specifically used to perform a cluster number search on the data in the majority class through the elbow rule during the process of clustering the data in the majority class. ;Cluster the data in the majority class according to the number of clusters searched.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
本申请实施方式还提供一种与前述实施方式所提供的数据均衡化方法对应的电子设备,以执行上述数据均衡化方法。An embodiment of the present application also provides an electronic device corresponding to the data equalization method provided in the previous embodiment to execute the above data equalization method.
图4为本申请根据一示例性实施例示出的一种电子设备的硬件结构图,该电子设备包括:通信接口601、处理器602、存储器603和总线604;其中,通信接口601、处理器602和存储器603通过总线604完成相互间的通信。处理器602通过读取并执行存储器603中与数据均衡化方法的控制逻辑对应的机器可执行指令,可执行上文描述的数据均衡化方法,该方法的具体内容参见上述实施例,此处不再累述。Figure 4 is a hardware structure diagram of an electronic device according to an exemplary embodiment of the present application. The electronic device includes: a communication interface 601, a processor 602, a memory 603 and a bus 604; wherein the communication interface 601, the processor 602 and the memory 603 complete communication with each other through the bus 604. The processor 602 can execute the above-described data equalization method by reading and executing machine-executable instructions corresponding to the control logic of the data equalization method in the memory 603. For details of the method, please refer to the above embodiments and will not be discussed here. Again.
本申请中提到的存储器603可以是任何电子、磁性、光学或其它物理存储装置,可以包含存储信息,如可执行指令、数据等等。具体地,存储器603可以是RAM(Random Access Memory,随机存取存储器)、闪存、存储驱动器(如硬盘驱动器)、任何类型的存储盘(如光盘、DVD等),或者类似的存储介质,或者它们的组合。通过至少一个通信接口601(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接,可以使用互联网、广域网、本地网、城域网等。The memory 603 mentioned in this application can be any electronic, magnetic, optical or other physical storage device, and can contain stored information, such as executable instructions, data, and so on. Specifically, the memory 603 can be RAM (Random Access Memory), flash memory, a storage drive (such as a hard drive), any type of storage disk (such as an optical disk, DVD, etc.), or similar storage media, or they The combination. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 601 (which can be wired or wireless), and the Internet, wide area network, local network, metropolitan area network, etc. can be used.
总线604可以是ISA总线、PCI总线或EISA总线等。所述总线可以分为地址总线、数据总线、控制总线等。其中,存储器603用于存储程序,所述处理器602在接收到执行指令后,执行所述程序。The bus 604 may be an ISA bus, a PCI bus, an EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. The memory 603 is used to store a program, and the processor 602 executes the program after receiving the execution instruction.
处理器602可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器602中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器602可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等; 还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。The processor 602 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 602 . The above-mentioned processor 602 can be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
本申请实施例提供的电子设备与本申请实施例提供的数据均衡化方法出于相同的发明构思,具有与其采用、运行或实现的方法相同的有益效果。The electronic device provided by the embodiments of the present application and the data equalization method provided by the embodiments of the present application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, run or implemented.
本申请实施方式还提供一种与前述实施方式所提供的数据均衡化方法对应的计算机可读存储介质,请参考图5所示,其示出的计算机可读存储介质为光盘30,其上存储有计算机程序(即程序产品),所述计算机程序在被处理器运行时,会执行前述任意实施方式所提供的数据均衡化方法。所述计算机可读存储介质可以是非易失性,也可以是易失性。The embodiment of the present application also provides a computer-readable storage medium corresponding to the data equalization method provided by the previous embodiment. Please refer to FIG. 5. The computer-readable storage medium shown is an optical disk 30, on which is stored There is a computer program (ie, a program product). When the computer program is run by a processor, the computer program will execute the data equalization method provided by any of the foregoing embodiments. The computer-readable storage medium may be non-volatile or volatile.
需要说明的是,所述计算机可读存储介质的例子还可以包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他光学、磁性存储介质,在此不再一一赘述。It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory. Access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be described in detail here.
本申请的上述实施例提供的计算机可读存储介质与本申请实施例提供的数据均衡化方法出于相同的发明构思,具有与其存储的应用程序所采用、运行或实现的方法相同的有益效果。The computer-readable storage medium provided by the above embodiments of the present application is based on the same inventive concept as the data equalization method provided by the embodiments of the present application, and has the same beneficial effects as the methods used, run or implemented by the applications stored therein.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above are only preferred embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the present application. within the scope of protection.

Claims (20)

  1. 一种数据均衡化方法,其中,所述方法包括:A data equalization method, wherein the method includes:
    将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
    按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
    对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
    采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
    合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
  2. 根据权利要求1所述的方法,其中,所述将表格中每条数据包含的变量取值转化为数值型取值,包括:The method according to claim 1, wherein converting the variable value contained in each piece of data in the table into a numerical value includes:
    对每条数据中缺失的变量取值进行填充;Fill in the missing variable values in each piece of data;
    将每条数据包含的变量取值进行数据编码,以转换为数值型取值。Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
  3. 根据权利要求1所述的方法,其中,所述目标变量取值包括两个;The method according to claim 1, wherein the target variable values include two;
    按照目标变量取值将表格中的数据划分为多数类和少数类,包括:Divide the data in the table into majority and minority classes according to the value of the target variable, including:
    统计每个目标变量取值的数量,将两个目标变量取值的数量进行比较;Count the number of values of each target variable and compare the number of values of the two target variables;
    将数量大的目标变量取值所属数据划分为多数类;Divide the data belonging to a large number of target variable values into majority categories;
    将数量小的目标变量取值所属数据划分为少数类。Divide the data with a small number of target variable values into minority classes.
  4. 根据权利要求1所述的方法,其中,根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果,包括:The method according to claim 1, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:
    确定每个聚类包含的数据在所述多数类中的数据占比;Determine the proportion of data contained in each cluster in the majority class;
    根据预设均衡比例和所述多数类中的数据总量确定欠采样抽取数量;Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;
    利用所述欠采样抽取数量和每个聚类的数据占比从相应聚类中抽取数据;Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;
    将从每个聚类中抽取的数据确定为多数类的欠采样结果。The data extracted from each cluster is determined to be the undersampled result of the majority class.
  5. 根据权利要求1所述的方法,其中,采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果,包括:The method according to claim 1, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain the oversampling result of the minority class, including:
    从所述少数类中抽取预设过采样比例的数据;Extract data with a preset oversampling ratio from the minority class;
    采用预设的随机扰动策略对抽取出的数据进行随机扰动处理;Use a preset random perturbation strategy to randomly perturb the extracted data;
    将扰动处理后的数据和少数类中的数据确定为少数类的过采样结果。The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
  6. 根据权利要求5所述的方法,其中,采用预设的随机扰动策略对抽取出的数据进行随机扰动处理,包括:The method according to claim 5, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:
    针对抽取出的每条数据,根据预设扰动比例将该条数据中的部分变量取值替换为预设的扰动值。For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.
  7. 根据权利要求1所述的方法,其中,对所述多数类中的数据进行聚类,包括:The method of claim 1, wherein clustering data in the majority class includes:
    通过肘部法则对所述多数类中的数据进行聚类数搜索;Perform a cluster number search on the data in the majority class using the elbow rule;
    按照搜索到的聚类数对所述多数类中的数据进行聚类。The data in the majority class is clustered according to the number of clusters searched.
  8. 一种数据均衡化装置,其中,所述装置包括:A data equalization device, wherein the device includes:
    数据处理模块,用于将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;The data processing module is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
    划分模块,用于按照目标变量取值将表格中的数据划分为多数类和少数类;The division module is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;
    欠采样模块,用于对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;An under-sampling module, used to cluster the data in the majority class, and perform under-sampling extraction of the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
    过采样模块,用于采用预设的随机扰动策略对所述少数类中的数据进行过采 样抽取,以得到少数类的过采样结果;An oversampling module, used to oversample the data in the minority class using a preset random perturbation strategy to obtain the oversampling result of the minority class;
    合并模块,用于合并所述欠采样结果和所述过采样结果,得到均衡后的数据。A merging module is used to combine the undersampling results and the oversampling results to obtain equalized data.
  9. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如下步骤:An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the program:
    将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
    按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
    对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
    采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果;Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
    合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
  10. 根据权利要求9所述的电子设备,其中,所述将表格中每条数据包含的变量取值转化为数值型取值,包括:The electronic device according to claim 9, wherein said converting the variable value contained in each piece of data in the table into a numerical value includes:
    对每条数据中缺失的变量取值进行填充;Fill in the missing variable values in each piece of data;
    将每条数据包含的变量取值进行数据编码,以转换为数值型取值。Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
  11. 根据权利要求9所述的电子设备,其中,所述目标变量取值包括两个;The electronic device according to claim 9, wherein the target variable values include two;
    按照目标变量取值将表格中的数据划分为多数类和少数类,包括:Divide the data in the table into majority and minority classes according to the value of the target variable, including:
    统计每个目标变量取值的数量,将两个目标变量取值的数量进行比较;Count the number of values of each target variable and compare the number of values of the two target variables;
    将数量大的目标变量取值所属数据划分为多数类;Divide the data belonging to a large number of target variable values into majority categories;
    将数量小的目标变量取值所属数据划分为少数类。Divide the data with a small number of target variable values into minority classes.
  12. 根据权利要求9所述的电子设备,其中,根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果,包括:The electronic device according to claim 9, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:
    确定每个聚类包含的数据在所述多数类中的数据占比;Determine the proportion of data contained in each cluster in the majority class;
    根据预设均衡比例和所述多数类中的数据总量确定欠采样抽取数量;Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;
    利用所述欠采样抽取数量和每个聚类的数据占比从相应聚类中抽取数据;Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;
    将从每个聚类中抽取的数据确定为多数类的欠采样结果。The data extracted from each cluster is determined to be the undersampled result of the majority class.
  13. 根据权利要求9所述的电子设备,其中,采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果,包括:The electronic device according to claim 9, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain an oversampling result of the minority class, including:
    从所述少数类中抽取预设过采样比例的数据;Extract data with a preset oversampling ratio from the minority class;
    采用预设的随机扰动策略对抽取出的数据进行随机扰动处理;Use a preset random perturbation strategy to randomly perturb the extracted data;
    将扰动处理后的数据和少数类中的数据确定为少数类的过采样结果。The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
  14. 根据权利要求13所述的电子设备,其中,采用预设的随机扰动策略对抽取出的数据进行随机扰动处理,包括:The electronic device according to claim 13, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:
    针对抽取出的每条数据,根据预设扰动比例将该条数据中的部分变量取值替换为预设的扰动值。For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如下步骤:A computer-readable storage medium on which a computer program is stored, wherein the following steps are implemented when the program is executed by a processor:
    将表格中每条数据包含的变量取值转化为数值型取值,每条数据包括自变量取值和目标变量取值;Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;
    按照目标变量取值将表格中的数据划分为多数类和少数类;Divide the data in the table into majority and minority classes according to the value of the target variable;
    对所述多数类中的数据进行聚类,并根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果;Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;
    采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少 数类的过采样结果;Use a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;
    合并所述欠采样结果和所述过采样结果,得到均衡后的数据。The undersampling result and the oversampling result are combined to obtain equalized data.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将表格中每条数据包含的变量取值转化为数值型取值,包括:The computer-readable storage medium according to claim 15, wherein said converting variable values contained in each piece of data in the table into numerical values includes:
    对每条数据中缺失的变量取值进行填充;Fill in the missing variable values in each piece of data;
    将每条数据包含的变量取值进行数据编码,以转换为数值型取值。Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述目标变量取值包括两个;The computer-readable storage medium according to claim 15, wherein the target variable value includes two;
    按照目标变量取值将表格中的数据划分为多数类和少数类,包括:Divide the data in the table into majority and minority classes according to the value of the target variable, including:
    统计每个目标变量取值的数量,将两个目标变量取值的数量进行比较;Count the number of values of each target variable and compare the number of values of the two target variables;
    将数量大的目标变量取值所属数据划分为多数类;Divide the data belonging to a large number of target variable values into majority categories;
    将数量小的目标变量取值所属数据划分为少数类。Divide the data with a small number of target variable values into minority classes.
  18. 根据权利要求15所述的计算机可读存储介质,其中,根据各个聚类的数据占比对多数类数据进行欠采样抽取,得到多数类的欠采样结果,包括:The computer-readable storage medium according to claim 15, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:
    确定每个聚类包含的数据在所述多数类中的数据占比;Determine the proportion of data contained in each cluster in the majority class;
    根据预设均衡比例和所述多数类中的数据总量确定欠采样抽取数量;Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;
    利用所述欠采样抽取数量和每个聚类的数据占比从相应聚类中抽取数据;Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;
    将从每个聚类中抽取的数据确定为多数类的欠采样结果。The data extracted from each cluster is determined to be the undersampled result of the majority class.
  19. 根据权利要求15所述的计算机可读存储介质,其中,采用预设的随机扰动策略对所述少数类中的数据进行过采样抽取,以得到少数类的过采样结果,包括:The computer-readable storage medium according to claim 15, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain the oversampling result of the minority class, including:
    从所述少数类中抽取预设过采样比例的数据;Extract data with a preset oversampling ratio from the minority class;
    采用预设的随机扰动策略对抽取出的数据进行随机扰动处理;Use a preset random perturbation strategy to randomly perturb the extracted data;
    将扰动处理后的数据和少数类中的数据确定为少数类的过采样结果。The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
  20. 根据权利要求19所述的计算机可读存储介质,其中,采用预设的随机扰动策略对抽取出的数据进行随机扰动处理,包括:The computer-readable storage medium according to claim 19, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:
    针对抽取出的每条数据,根据预设扰动比例将该条数据中的部分变量取值替换为预设的扰动值。For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.
PCT/CN2022/090170 2022-03-16 2022-04-29 Data equalization method and apparatus, and electronic device and storage medium WO2023173548A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210258472.1 2022-03-16
CN202210258472.1A CN114661701A (en) 2022-03-16 2022-03-16 Data equalization method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023173548A1 true WO2023173548A1 (en) 2023-09-21

Family

ID=82029966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090170 WO2023173548A1 (en) 2022-03-16 2022-04-29 Data equalization method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN114661701A (en)
WO (1) WO2023173548A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108881196A (en) * 2018-06-07 2018-11-23 中国民航大学 The semi-supervised intrusion detection method of model is generated based on depth
CN110298451A (en) * 2019-06-10 2019-10-01 上海冰鉴信息科技有限公司 A kind of equalization method and device of the lack of balance data set based on Density Clustering
CN112541536A (en) * 2020-12-09 2021-03-23 长沙理工大学 Under-sampling classification integration method, device and storage medium for credit scoring
CN112700324A (en) * 2021-01-08 2021-04-23 北京工业大学 User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112990329A (en) * 2021-03-26 2021-06-18 清华大学 System abnormity diagnosis method and device
CN113098862A (en) * 2021-03-31 2021-07-09 昆明理工大学 Intrusion detection method based on combination of hybrid sampling and expansion convolution
CN113111054A (en) * 2021-04-13 2021-07-13 中国石油大学(华东) Industrial data balance processing algorithm based on combination of oversampling and undersampling
US20210262947A1 (en) * 2018-05-28 2021-08-26 Riken Method and device for acquiring tomographic image data by oversampling, and control program

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210262947A1 (en) * 2018-05-28 2021-08-26 Riken Method and device for acquiring tomographic image data by oversampling, and control program
CN108881196A (en) * 2018-06-07 2018-11-23 中国民航大学 The semi-supervised intrusion detection method of model is generated based on depth
CN110298451A (en) * 2019-06-10 2019-10-01 上海冰鉴信息科技有限公司 A kind of equalization method and device of the lack of balance data set based on Density Clustering
CN112541536A (en) * 2020-12-09 2021-03-23 长沙理工大学 Under-sampling classification integration method, device and storage medium for credit scoring
CN112700324A (en) * 2021-01-08 2021-04-23 北京工业大学 User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112990329A (en) * 2021-03-26 2021-06-18 清华大学 System abnormity diagnosis method and device
CN113098862A (en) * 2021-03-31 2021-07-09 昆明理工大学 Intrusion detection method based on combination of hybrid sampling and expansion convolution
CN113111054A (en) * 2021-04-13 2021-07-13 中国石油大学(华东) Industrial data balance processing algorithm based on combination of oversampling and undersampling

Also Published As

Publication number Publication date
CN114661701A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN110363449B (en) Risk identification method, device and system
CN104915879B (en) The method and device that social relationships based on finance data are excavated
TWI718643B (en) Method and device for identifying abnormal groups
TW201928841A (en) Method, apparatus, and device for training risk control model and risk control
JP6434154B2 (en) Identifying join relationships based on transaction access patterns
US10459934B2 (en) Re-sizing data partitions for ensemble models in a mapreduce framework
WO2018090545A1 (en) Time-factor fusion collaborative filtering method, device, server and storage medium
CN104077723B (en) A kind of social networks commending system and method
US10394907B2 (en) Filtering data objects
WO2024045989A1 (en) Graph network data set processing method and apparatus, electronic device, program, and medium
WO2023103527A1 (en) Access frequency prediction method and device
WO2020259325A1 (en) Feature processing method applicable to machine learning, and device
CN113837635A (en) Risk detection processing method, device and equipment
CN115410199A (en) Image content retrieval method, device, equipment and storage medium
WO2020253037A1 (en) Target area screening method and device
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
WO2023173548A1 (en) Data equalization method and apparatus, and electronic device and storage medium
Xu et al. Dynamic clustering for short text stream based on Dirichlet process
CN112989182A (en) Information processing method, information processing apparatus, information processing device, and storage medium
CN108446738A (en) A kind of clustering method, device and electronic equipment
CN116611914A (en) Salary prediction method and device based on grouping statistics
Hwang et al. Statistical strategies for the analysis of massive data sets
CN108255880A (en) Data processing method and device
WO2019227415A1 (en) Scorecard model adjustment method, device, server and storage medium
Liu et al. Social Network Community‐Discovery Algorithm Based on a Balance Factor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931575

Country of ref document: EP

Kind code of ref document: A1