WO2023173548A1

WO2023173548A1 - Data equalization method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023173548A1
Application number: PCT/CN2022/090170
Authority: WO
Inventors: 王彦; 谢淋; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-16
Filing date: 2022-04-29
Publication date: 2023-09-21
Also published as: CN114661701A

Abstract

Disclosed in the present application are a data equalization method and apparatus, and an electronic device and a storage medium. The method comprises: converting, into a numerical value, a variable value included in each piece of data in a table, wherein each piece of data comprises an independent variable value and a target variable value; dividing the data in the table into a majority class and a minority class according to the target variable value; clustering data in the majority class, and performing undersampling extraction on the data in the majority class according to the data proportion of each cluster, so as to obtain an undersampling result of the majority class; performing oversampling extraction on data in the minority class by using a preset random perturbation policy, so as to obtain an oversampling result of the minority class; and combining the undersampling result and the oversampling result, so as to obtain equalized data. Undersampling extraction is realized by means of clustering data in a majority class, such that extracted data has a relatively strong representativeness for the majority class. In addition, undersampling extraction is realized by means of a random perturbation policy, such that the problem of overfitting of subsequent model training that is caused by simple repetition of data in a minority class can be avoided.

Description

A data equalization method, device, electronic equipment and storage medium

priority statement

This application claims the priority of the Chinese patent application submitted to the China Patent Office on March 16, 2022, with the application number 202210258472.1, and the invention name is "A data equalization method, device, electronic equipment and storage medium", and its entire content incorporated herein by reference.

Technical field

This application relates to the field of big data technology, specifically to a data equalization method, device, electronic equipment and storage medium.

Background technique

Table data is a very common data format. A table generally contains multiple fields, and each field has a clear meaning. For example, the customer information table in the customer loan fraud prediction scenario can record the customer's "age", "gender", "education level", "loan amount" and other variables. These variables are called independent variables, "whether there is a default" As the target variable, the target variable needs to predict the value based on the value of the independent variable. When modeling tabular data, many models assume that the data distribution of the target variable is balanced, but in reality the tabular data distribution is uneven. For example, in the problem of customer loan fraud prediction, the proportion of customers who commit loan fraud is often Less than 10%, while non-fraudulent customers account for more than 90%. Therefore, the inventor realized that it is very necessary to equalize the table data.

Contents of the invention

The purpose of this application is to propose a data equalization method, device, electronic equipment and storage medium in view of the above-mentioned shortcomings of the prior art. This purpose is achieved through the following technical solutions.

The first aspect of this application proposes a data equalization method, which method includes:

Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

Divide the data in the table into majority and minority classes according to the value of the target variable;

Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;

The undersampling result and the oversampling result are combined to obtain equalized data.

The second aspect of this application proposes a data equalization device, which includes:

The data processing module is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

The division module is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;

An under-sampling module, used to cluster the data in the majority class, and perform under-sampling extraction of the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

An oversampling module, used to oversample the data in the minority class using a preset random perturbation strategy to obtain the oversampling result of the minority class;

A merging module is used to combine the undersampling results and the oversampling results to obtain equalized data.

The third aspect of the application proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the program:

A fourth aspect of the present application proposes a computer-readable storage medium on which a computer program is stored, wherein the following steps are implemented when the program is executed by a processor:

Based on the data equalization method and device described in the first and second aspects above, this application has at least the following beneficial effects or advantages:

When this scheme undersamples the majority class data, it undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted data is more representative of the majority class and is better than random sampling. Better results. When oversampling minority class data, this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.

Description of the drawings

The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation of the present application. In the attached picture:

Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment;

Figure 2 is a schematic diagram of an elbow rule shown in this application according to an exemplary embodiment;

Figure 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application;

Figure 4 is a schematic diagram of the hardware structure of an electronic device according to an exemplary embodiment of the present application;

Figure 5 is a schematic structural diagram of a storage medium according to an exemplary embodiment of the present application.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the appended claims.

The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present application, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

At present, the common balancing strategies used to solve the problem of table data imbalance include:

①The data is randomly oversampled, that is, some minority class samples are randomly repeated. ②The data is randomly undersampled, that is, some majority class samples are randomly deleted. ③ Generate pseudo data based on the idea of interpolation of nearest neighbor samples, such as the SMOTE algorithm. ④ Under-sampling methods based on data classification contribution, such as One-Sided Selection, this method believes that points on the classification boundary are often more important for building a classification model.

However, experiments have found that each of the above methods has shortcomings. Specifically: ① Random oversampling achieves sample balance by randomly repeating some minority class samples. Although the execution speed is fast, direct sample replication will increase the risk of model overfitting. . ② Random undersampling, by randomly deleting some majority class samples, faces the risk of losing information on majority class samples, and the model may be underfitted. ③The idea of interpolating to generate pseudo data based on nearest neighbor samples is inefficient when the data dimension is high or the sample size is huge. In tabular data scenarios, each field has a specific meaning. However, the meaning of the generated pseudo data values is difficult to interpret. . ④ Under-sampling based on the contribution of samples to classification can effectively eliminate redundant and noise points in most classes, making the classification boundaries clear. However, due to the high complexity of the algorithm, the operating efficiency is relatively low on large data sets.

Based on this, the data equalization method proposed in this patent achieves better modeling effects by simultaneously undersampling majority class samples and oversampling minority class samples.

The specific implementation process is: convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable, and divide the data in the table into Majority class and minority class, then cluster the data in the majority class, and perform undersampling on the majority class data according to the data proportion of each cluster to obtain the undersampling result of the majority class, and use preset random perturbations The strategy oversamples the data in the minority class to obtain the oversampling result of the minority class, and finally combines the undersampling result and the oversampling result to obtain balanced data.

The technical effects that can be achieved based on the above description are:

In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the present application will be clearly and completely described below in conjunction with the drawings in the embodiment of the present application.

Example 1:

Figure 1 is an embodiment flow chart of a data equalization method shown in this application according to an exemplary embodiment. In this embodiment, the data in the table is used as model training data as an example for illustration. One item in the table The data is used as a sample, and the data in the entire table is used as a training set. The data equalization method includes the following steps:

Step 101: Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable.

Among them, each table contains multiple fields, and each field has a clear meaning. For example, the customer information table in the customer loan fraud prediction scenario can record the customer's "age", "gender", "education level", Variables such as "loan amount" are called independent variables, and "whether there is a default" is the target variable. Both independent variables and target variables will have non-numeric variables. For subsequent data processing, non-numeric variables need to be The variable value is converted into a numeric value.

The specific conversion process is as follows:

First, fill in the missing variable values in each piece of data. Specifically, for continuous variables, the median of the variable is used to fill, and for discrete variables, the mode of the variable is used to fill.

Then, the variable values contained in each piece of data are data-encoded to convert them into numerical values. Specifically, one-hot encoding is performed on discrete variables that do not have a large or small relationship. For example, the value of the "gender" variable is "male" and "female". After one-hot encoding, a new variable is formed: "gender_male", "Gender_Female", each new variable has a value of 0 or 1. Convert discrete variables with large and small relationships into numbers. For example, the value of the "education degree" variable is "Doctorate", "Master's", "Bachelor's degree", "College degree", "High school and below". After conversion, the corresponding relationship between the values is: High school and below: 0, junior college: 1, undergraduate: 2, master: 3, doctorate: 4.

Step 102: Divide the data in the table into majority classes and minority classes according to the value of the target variable.

In an optional embodiment, when the target variable values include two, the specific division method is: count the number of values of each target variable, compare the number of values of the two target variables, and divide the larger number The data belonging to the target variable value is divided into the majority class, and the data belonging to the target variable value with a small number is divided into the minority class.

For example, the target variable is "whether there is a breach of contract". The variable value 1 indicates breach of contract, and the variable value 0 indicates no breach of contract. Assume that the target variable value 1 has 400 samples, and the target variable value 0 has 10 samples. Then the majority category: minority Class=40:1.

Step 103: Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class.

In an optional embodiment, in the process of clustering the data in the majority class, the elbow rule can be used to perform a cluster number search on the data in the majority class, so that the data in the majority class can be clustered according to the number of clusters searched. Data in most classes are clustered.

Specifically, the clustering algorithm can use the K-Means clustering algorithm. The key is to select a more appropriate K value. Generally, the sum of squared distances (SSE) from the sample points in the K clusters to the cluster centroid is used as the clustering effect measure. Index, the smaller the SSE is, the more convergent each cluster is. However, it is not that the smaller the SSE, the better, because an extreme case is to regard all sample points as clusters, in which case the SSE is 0, but it is meaningless. We need to find a balance between the number of clusters K and SSE.

Therefore, the appropriate number of clusters K is selected through the elbow rule. First, specify the maximum possible number of clusters, N. Then increase the number of clusters from 1 to N, and calculate N SSEs. Experiments show that when the set number of clusters continues to approach the real number of clusters in the data, SSE shows a rapid decline. When the set number of clusters exceeds the real number of clusters, SSE will continue to decline, but the decline rate tends to be slow. Draw the SSE curve based on N SSE values and find the inflection point in the descending process, which is the more appropriate clustering K value.

As shown in Figure 2, the inflection point of SSE appears when K=3. Therefore, for this data set, when using K-Means clustering, the number of clusters is 3.

It should be noted that in order to improve the balancing efficiency, after the clustering number K is determined for the first time, the clustering number K can be directly used for subsequent clustering.

Furthermore, after clustering the data in the majority category, each sample will be assigned a cluster number after clustering, and then the number of samples owned by each cluster and the proportion of each cluster will be counted to obtain each cluster. proportion of data.

For example, the number of majority class samples in the training set is 400, and the number of clusters K=3 is set. Assume that the number of samples in each cluster after clustering is 200, 100, and 100 respectively, then the proportion of each cluster is 0.5, 0.25, 0.25.

In an optional embodiment, for the process of undersampling the majority class data according to the data proportion of each cluster, the data proportion of the data contained in each cluster in the majority class can be determined, and Determine the number of undersampled extractions based on the preset equilibrium proportion and the total amount of data in the majority class, and then use the number of undersampled extractions and the data proportion of each cluster to extract data from the corresponding clusters, and extract data from each cluster. The data extracted from each cluster is determined to be the undersampling result of the majority class.

Among them, the preset equilibrium ratio is the ideal ratio between the majority class and the minority class. Assume that the majority class in the training set has 400 samples, the number of minority class samples in the training set is 10, and the current number ratio between the majority class and the minority class is 40 : 1. If you want the ratio of the majority class to the minority class to be reduced to 10:1 after equalization, you need to extract 100 samples from the majority class samples in the training set, that is, the number of undersampled samples is 100.

Furthermore, according to the data proportion of each cluster, the total majority class samples that need to be extracted are allocated to each cluster in proportion. For example, if the number of clusters K=3, and the numbers of each cluster are 200, 100, and 100 respectively, then the cluster proportions are 0.5, 0.25, and 0.25. A total of 100 majority class samples need to be extracted from the training set. Then, after proportional sharing, the number of samples that need to be extracted from each cluster is: 0.5*100, 0.25*100, 0.25*100.

Then, random sampling is used to extract the sample size allocated to the cluster from each cluster, that is, data is extracted from the corresponding cluster.

Step 104: Use a preset random perturbation strategy to oversample the data in the minority class to obtain an oversampling result of the minority class.

In a specific embodiment, for the oversampling extraction process of the minority class, data with a preset oversampling ratio is extracted from the minority class, and a preset random perturbation strategy is used to perform random perturbation processing on the extracted data. Then the perturbed data and the data in the minority class are determined as the oversampling results of the minority class.

For example, if there are 10 minority class samples in the training set and the oversampling ratio is set to 0.8, 8 minority class samples need to be randomly extracted from the training set.

Furthermore, by adding a certain proportion of noise to the samples, the model is prevented from overfitting on the minority class samples, so that when the extracted data are randomly perturbed, for each piece of extracted data, the preset perturbation ratio is used to Some variable values in this piece of data are replaced with preset disturbance values.

For example, if you set a MASK ratio, such as 10%, then 10% of the variable values in each sample will be randomly replaced with MASK. In order to facilitate subsequent model calculations, MASK is represented by using a larger number (such as 999).

As shown in Table 1, the tabular data has a total of 10 features (independent variables) and 1 and the target variable (Y). If the MASK ratio is set to 20%, 2 random MASK features are required, that is, the original value is replaced with 999.

Table 1

Step 105: Combine the undersampling results and the oversampling results to obtain equalized data.

It should be added that the balanced data is used as a training data set to train the artificial intelligence model, and the generalization ability of the model is verified on the test set.

At this point, the data equalization process shown in Figure 1 above is completed. When undersampling the majority class data, this scheme undersamples the majority class data according to the data proportion of each cluster after clustering, so that the extracted The data is highly representative of most classes and is better than random sampling. When oversampling minority class data, this solution uses a random perturbation strategy to randomly perturb the extracted data, which can avoid over-fitting problems in subsequent model training caused by simple repetition of minority class data. At the same time, the random perturbation execution efficiency is high. Random perturbations in tens of thousands of data can be responded to in seconds. Using this solution to perform data equalization on large-scale, severely unbalanced tabular data has a significant effect, and can significantly improve the recall rate and precision rate of minority data.

Corresponding to the foregoing embodiments of the data equalization method, this application also provides an embodiment of a data equalization device.

Figure 3 is a schematic structural diagram of a data equalization device according to an exemplary embodiment of the present application. The device is used to perform the data equalization method provided in any of the above embodiments. As shown in Figure 3, the data equalization Devices include:

The data processing module 310 is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes an independent variable value and a target variable value;

The dividing module 320 is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;

The under-sampling module 330 is used to cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

The oversampling module 340 is configured to use a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;

The merging module 350 is used to combine the undersampling results and the oversampling results to obtain equalized data.

In an optional implementation, the data processing module 310 is specifically used to fill in the missing variable values in each piece of data; perform data encoding on the variable values contained in each piece of data to convert them into numerical values. Take value.

In an optional implementation, the target variable values include two; the dividing module 320 is specifically used to count the number of values of each target variable and compare the number of values of the two target variables; Divide the data with a large number of target variable values into the majority class; divide the data with a small number of target variable values into the minority class.

In an optional implementation, the under-sampling module 330 is specifically used to under-sample the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class. The data proportion of the data contained in the cluster in the majority class; determine the number of under-sampling extractions based on the preset equilibrium ratio and the total amount of data in the majority class; use the number of under-sampling extractions and the number of each cluster Data proportion extracts data from the corresponding clusters; the data extracted from each cluster is determined as the undersampled result of the majority class.

In an optional implementation, the oversampling module 340 is specifically used to extract data with a preset oversampling ratio from the minority class; and use a preset random perturbation strategy to perform random perturbation processing on the extracted data. ; Determine the perturbed data and the data in the minority class as the oversampling result of the minority class.

In an optional implementation, the oversampling module 340 is specifically used to perform random perturbation processing on the extracted data using a preset random perturbation strategy. For each piece of extracted data, the preset perturbation Proportion replaces some variable values in this piece of data with preset disturbance values.

In an optional implementation, the undersampling module 330 is specifically used to perform a cluster number search on the data in the majority class through the elbow rule during the process of clustering the data in the majority class. ;Cluster the data in the majority class according to the number of clusters searched.

For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.

As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

An embodiment of the present application also provides an electronic device corresponding to the data equalization method provided in the previous embodiment to execute the above data equalization method.

Figure 4 is a hardware structure diagram of an electronic device according to an exemplary embodiment of the present application. The electronic device includes: a communication interface 601, a processor 602, a memory 603 and a bus 604; wherein the communication interface 601, the processor 602 and the memory 603 complete communication with each other through the bus 604. The processor 602 can execute the above-described data equalization method by reading and executing machine-executable instructions corresponding to the control logic of the data equalization method in the memory 603. For details of the method, please refer to the above embodiments and will not be discussed here. Again.

The memory 603 mentioned in this application can be any electronic, magnetic, optical or other physical storage device, and can contain stored information, such as executable instructions, data, and so on. Specifically, the memory 603 can be RAM (Random Access Memory), flash memory, a storage drive (such as a hard drive), any type of storage disk (such as an optical disk, DVD, etc.), or similar storage media, or they The combination. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 601 (which can be wired or wireless), and the Internet, wide area network, local network, metropolitan area network, etc. can be used.

The bus 604 may be an ISA bus, a PCI bus, an EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. The memory 603 is used to store a program, and the processor 602 executes the program after receiving the execution instruction.

The processor 602 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 602 . The above-mentioned processor 602 can be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.

The electronic device provided by the embodiments of the present application and the data equalization method provided by the embodiments of the present application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, run or implemented.

The embodiment of the present application also provides a computer-readable storage medium corresponding to the data equalization method provided by the previous embodiment. Please refer to FIG. 5. The computer-readable storage medium shown is an optical disk 30, on which is stored There is a computer program (ie, a program product). When the computer program is run by a processor, the computer program will execute the data equalization method provided by any of the foregoing embodiments. The computer-readable storage medium may be non-volatile or volatile.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory. Access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be described in detail here.

The computer-readable storage medium provided by the above embodiments of the present application is based on the same inventive concept as the data equalization method provided by the embodiments of the present application, and has the same beneficial effects as the methods used, run or implemented by the applications stored therein.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It should also be noted that the terms "comprises," "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

The above are only preferred embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the present application. within the scope of protection.

Claims

A data equalization method, wherein the method includes:

Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

Divide the data in the table into majority and minority classes according to the value of the target variable;

Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;

The undersampling result and the oversampling result are combined to obtain equalized data.
The method according to claim 1, wherein converting the variable value contained in each piece of data in the table into a numerical value includes:

Fill in the missing variable values in each piece of data;

Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
The method according to claim 1, wherein the target variable values include two;

Divide the data in the table into majority and minority classes according to the value of the target variable, including:

Count the number of values of each target variable and compare the number of values of the two target variables;

Divide the data belonging to a large number of target variable values into majority categories;

Divide the data with a small number of target variable values into minority classes.
The method according to claim 1, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:

Determine the proportion of data contained in each cluster in the majority class;

Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;

Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;

The data extracted from each cluster is determined to be the undersampled result of the majority class.
The method according to claim 1, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain the oversampling result of the minority class, including:

Extract data with a preset oversampling ratio from the minority class;

Use a preset random perturbation strategy to randomly perturb the extracted data;

The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
The method according to claim 5, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:

For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.
The method of claim 1, wherein clustering data in the majority class includes:

Perform a cluster number search on the data in the majority class using the elbow rule;

The data in the majority class is clustered according to the number of clusters searched.
A data equalization device, wherein the device includes:

The data processing module is used to convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

The division module is used to divide the data in the table into majority classes and minority classes according to the value of the target variable;

An under-sampling module, used to cluster the data in the majority class, and perform under-sampling extraction of the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

An oversampling module, used to oversample the data in the minority class using a preset random perturbation strategy to obtain the oversampling result of the minority class;

A merging module is used to combine the undersampling results and the oversampling results to obtain equalized data.
An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the program:

Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

Divide the data in the table into majority and minority classes according to the value of the target variable;

Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

Using a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;

The undersampling result and the oversampling result are combined to obtain equalized data.
The electronic device according to claim 9, wherein said converting the variable value contained in each piece of data in the table into a numerical value includes:

Fill in the missing variable values in each piece of data;

Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
The electronic device according to claim 9, wherein the target variable values include two;

Divide the data in the table into majority and minority classes according to the value of the target variable, including:

Count the number of values of each target variable and compare the number of values of the two target variables;

Divide the data belonging to a large number of target variable values into majority categories;

Divide the data with a small number of target variable values into minority classes.
The electronic device according to claim 9, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:

Determine the proportion of data contained in each cluster in the majority class;

Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;

Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;

The data extracted from each cluster is determined to be the undersampled result of the majority class.
The electronic device according to claim 9, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain an oversampling result of the minority class, including:

Extract data with a preset oversampling ratio from the minority class;

Use a preset random perturbation strategy to randomly perturb the extracted data;

The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
The electronic device according to claim 13, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:

For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.
A computer-readable storage medium on which a computer program is stored, wherein the following steps are implemented when the program is executed by a processor:

Convert the variable values contained in each piece of data in the table into numerical values. Each piece of data includes the value of the independent variable and the value of the target variable;

Divide the data in the table into majority and minority classes according to the value of the target variable;

Cluster the data in the majority class, and perform under-sampling extraction on the majority class data according to the data proportion of each cluster to obtain the under-sampling result of the majority class;

Use a preset random perturbation strategy to oversample the data in the minority class to obtain the oversampling result of the minority class;

The undersampling result and the oversampling result are combined to obtain equalized data.
The computer-readable storage medium according to claim 15, wherein said converting variable values contained in each piece of data in the table into numerical values includes:

Fill in the missing variable values in each piece of data;

Data encoding is performed on the variable values contained in each piece of data to convert them into numerical values.
The computer-readable storage medium according to claim 15, wherein the target variable value includes two;

Divide the data in the table into majority and minority classes according to the value of the target variable, including:

Count the number of values of each target variable and compare the number of values of the two target variables;

Divide the data belonging to a large number of target variable values into majority categories;

Divide the data with a small number of target variable values into minority classes.
The computer-readable storage medium according to claim 15, wherein the majority class data is under-sampled according to the data proportion of each cluster to obtain the under-sampling result of the majority class, including:

Determine the proportion of data contained in each cluster in the majority class;

Determine the number of undersampled extractions based on the preset equalization ratio and the total amount of data in the majority class;

Extract data from the corresponding cluster using the under-sampling extraction quantity and the data proportion of each cluster;

The data extracted from each cluster is determined to be the undersampled result of the majority class.
The computer-readable storage medium according to claim 15, wherein a preset random perturbation strategy is used to oversample the data in the minority class to obtain the oversampling result of the minority class, including:

Extract data with a preset oversampling ratio from the minority class;

Use a preset random perturbation strategy to randomly perturb the extracted data;

The perturbed data and the data in the minority class are determined as the oversampling results of the minority class.
The computer-readable storage medium according to claim 19, wherein a preset random perturbation strategy is used to perform random perturbation processing on the extracted data, including:

For each piece of data extracted, some variable values in the piece of data are replaced with preset disturbance values according to the preset disturbance ratio.