CN112579711A

CN112579711A - Method and device for classifying unbalanced data, storage medium and equipment

Info

Publication number: CN112579711A
Application number: CN202011584448.4A
Authority: CN
Inventors: 张显聪; 杨珏; 范旭娟; 陈雁; 何锦强; 廖永力; 朱登杰
Original assignee: China South Power Grid International Co ltd; Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: China South Power Grid International Co ltd; Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-03-30
Anticipated expiration: 2040-12-28
Also published as: CN112579711B

Abstract

The invention relates to the technical field of machine learning, and discloses a method, a device, a storage medium and equipment for classifying unbalanced data, wherein the method comprises the following steps: the method comprises the steps of obtaining an unbalanced data set, calculating a support vector set of the unbalanced data set through an SVM algorithm, calculating a first distance from each sample in a majority set to each support vector in the support vector set, calculating a sample position statistic according to the first distance, calculating a class bit statistic according to the sample position statistic, and performing down-sampling on the majority set according to the class bit statistic to obtain the down-sampled majority set. According to the method, the device, the storage medium and the equipment for classifying the unbalanced data, provided by the invention, the local density information of the data sample is measured by utilizing the distance between the data sample and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the accuracy of classifying the unbalanced data is improved.

Description

Method and device for classifying unbalanced data, storage medium and equipment

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a storage medium, and a device for classifying unbalanced data.

Background

The solution of the unbalanced data classification problem mainly has two aspects: a data plane and an algorithm plane. The data layer method comprises upsampling and downsampling, and the unbalance degree is reduced and the classification effect is improved by changing the data distribution; in the algorithm level, the classification accuracy is improved by analyzing the defects of the existing algorithm in processing unbalanced data or proposing a new algorithm, such as cost-sensitive learning, ensemble learning and the like.

Currently, unbalanced data is researched, and many improved models based on a SMOTE (Synthetic minimum optimization Oversampling Technique) algorithm are used, but the SMOTE algorithm causes the generated samples of the Minority class to overlap, because the generated samples are generated randomly by each Minority class and the distribution characteristics of adjacent samples are ignored. For example, SVMSMOTE sampling generates new samples based on the hyperplane of the SVM, and BorderlineSMOTE sampling generates samples near the minority class boundary points. The above sampling method reduces the degree of imbalance between classes by increasing the number of samples of a few classes.

The down-sampling refers to reducing the number of most samples to reduce the degree of imbalance between classes, and selecting samples with the number equivalent to that of a few samples in most classes by using some indexes, but some current methods cannot better ensure that only redundant samples and noise samples are removed, and the classification accuracy is low.

Therefore, when solving the problem of unbalanced classification at the data level, it is important to measure the degree of imbalance of data distribution between the minority class and the majority class, and how to increase effective minority class sample data and delete redundant majority class sample data has important research value.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is as follows: the method, the device, the storage medium and the equipment for classifying the unbalanced data measure the local density information of the data samples by using the distance between the data samples and the support vector, consider the unbalanced degree of the data from the aspect of distribution and improve the accuracy of classifying the unbalanced data.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for classifying unbalanced data, where the method includes:

acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;

calculating a set of support vectors of the unbalanced data set through an SVM algorithm;

calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;

calculating a sample location statistic from the first distance;

calculating a class bit statistic from the sample position statistic;

and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.

As a preferable scheme, the calculation formula of the first distance is as follows:

in the formula (1), c_-1For the majority class set, x_i∈c_-1Denotes x_iIs c_-1Element (ii) of (iii), x_iFor the ith sample, x, in the majority class set_i＝(x_i1,x_i2,…,x_in)，s_jFor the jth support vector, s, in the set of support vectors_i＝(s_i1,s_i2,…,s_in)，d(x_i,s_j) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.

As a preferred solution, the calculation formula of the sample position statistic is as follows:

in the formula (2), Q_k(s) is x_iK neighbor of (2) support vector set, s_j∈Q_k(s) represents s_jIs Q_kElement in(s), μ_k(x_i) A sample position statistic for the ith sample in the majority class set.

As a preferred scheme, the calculation formula of the class bit statistic is as follows:

in the formula (3), w_iClass bit statistics for the ith sample in the majority class set.

As a preferred scheme, the downsampling the majority class set according to the class bit statistics specifically includes:

pressing samples in the majority class set to w_iSorting the sizes, and dividing the sizes into two parts according to median;

w_ismaller part, down-sampling (m)_-1-m₁)m₁/m_-1A sample is obtained; wherein m is₁For the number of samples in the minority set, m_-1The number of samples in the majority class set;

w_ilarger part, down-sampling (m)_-1-m₁)(1-m₁/m_-1) And (4) sampling.

As a preferable aspect, the method further includes:

the classification results are evaluated by means of a confusion matrix.

In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides an apparatus for classifying unbalanced data, where the apparatus includes:

the data acquisition module is used for acquiring the unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;

the support vector calculation module is used for calculating a support vector set of the unbalanced data set through an SVM algorithm;

a first distance calculation module, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;

a sample position statistic calculation module for calculating a sample position statistic from the first distance;

the class bit statistic calculation module is used for calculating class bit statistic according to the sample position statistic;

and the down-sampling module is used for carrying out down-sampling on the majority class set according to the class bit statistic to obtain the down-sampled majority class set.

As a preferable aspect, the apparatus further includes:

and the evaluation module is used for evaluating the classification result through the confusion matrix.

In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed, implements the method for classifying imbalance data according to any one of the first aspect.

In order to solve the technical problem described above, in a fourth aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and configured to be executed by the processor, and when the computer program is executed by the processor, the terminal device implements the method for classifying unbalanced data according to any one of the first aspect.

Compared with the prior art, the method, the device, the storage medium and the equipment for classifying the unbalanced data have the advantages that: firstly, acquiring an unbalanced data set, then calculating a support vector set of the unbalanced data set through an SVM algorithm, secondly, calculating a first distance from each sample in a majority set to each support vector in the support vector set, thirdly, calculating a sample position statistic according to the first distance, thirdly, calculating a class bit statistic according to the sample position statistic, and lastly, downsampling the majority set according to the class bit statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.

Drawings

In order to more clearly illustrate the technical features of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive labor.

FIG. 1 is a flow chart of a preferred embodiment of a method for classifying imbalance data according to the present invention;

FIG. 2 is a schematic structural diagram of a preferred embodiment of an apparatus for classifying imbalance data according to the present invention;

fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device provided in the present invention.

Detailed Description

In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without inventive step, shall fall within the scope of the present invention.

In the description of the present invention, it should be understood that the numbers themselves, such as "first", "second", etc., are used only for distinguishing the described objects, do not have a sequential or technical meaning, and cannot be understood as defining or implying the importance of the described objects.

It should be noted that, with the rapid development of the economy in China, the demand of the society for electric power is increasing. In order to ensure the normal operation of the transmission line, the method has important significance for the correct classification of common faults. The faults of common power transmission lines are collected by using technologies such as an electric power internet of things technology and an unmanned aerial vehicle information collection system, data are displayed, the limitation of defect data is caused due to the contingency of the fault occurrence of the power transmission lines, and various faults are not determined, so that the data amount of various faults is unbalanced. Aiming at the characteristics of the transmission line faults and the defects of the existing fault classification method, the research on the more effective transmission line fault classification technology has important research value.

Fig. 1 is a schematic flow chart of a classification method of unbalanced data according to a preferred embodiment of the present invention.

As shown in fig. 1, the method comprises the steps of:

s11: acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;

s12: calculating a set of support vectors of the unbalanced data set through an SVM algorithm;

s13: calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;

s14: calculating a sample location statistic from the first distance;

s15: calculating a class bit statistic from the sample position statistic;

s16: and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.

Specifically, in the embodiment of the present invention, the classification method is applied to transmission line fault classification.

As an example, a bird nest data set and an insulated sub data set are selected. Wherein the positive type of the bird nest data set is data with bird nests, and the negative type of the bird nest data set is data without bird nests; the positive type of the insulator data set is insulator damage data, and the negative type is normal insulator data.

First, an unbalanced data set T { (x) in a feature space is obtained₁,y₁),(x₂,y₂),…,(x_N,y_N) Wherein x is_i∈RⁿFor the ith sample, any sample in the unbalanced data set T is denoted x_i＝(x_i1,x_i2,…,x_in)，y_i∈{+1,-1},i＝1,2,…N，y_iThe case of +1 is positive and the case of-1 is negative. Classifying the data set T into a minority class set c according to labels₁And majority class set c_-1And the number is respectively marked as m₁And m_-1And is initialized

The positive and negative classification tables of the data set are shown in table 1.

TABLE 1

Then, calculating a support vector set S of the data set T through an SVM algorithm, wherein S represents a support vectorNumber of quantities, any support vector denoted s_i＝(s_i1,s_i2,…,s_in)。

Second, calculate the majority class set C_-1Each sample x in_iTo each support vector S in the set S of support vectors_iEach sample x_iThe | S | distances are calculated, respectively. In this embodiment, the euclidean distance is used to calculate the first distance, that is, the calculation formula of the first distance is:

Again, based on the first distance, a sample location statistic is calculated, an embodiment of the invention for the ith sample x of a given data set_iDefining a support vector based sample position statistic, said sample position statistic being calculated as follows:

in the formula (2), Q_k(s) is x_iK neighbor of (2) support vector set, s_j∈Q_k(s) represents s_jIs Q_kElement in(s), μ_k(x_i) Sample position statistics, μ, for the ith sample in the majority class set_k(x_i) Express the ithSample x_iNearby density information.

Then, calculating a class bit statistic according to the sample position statistic, wherein the formula of calculating the class bit statistic is as follows:

in the formula (3), w_iClass bit statistic for the ith sample in the majority class set, w_iIs x_iAt c_-1The ratio in the total density, reflects x_iInner density of (2), x_iThe higher the density in the vicinity, w_iAnd mu_k(x_i) The smaller.

Finally, downsampling the majority class set according to the class bit statistics to obtain the downsampled majority class set, which specifically comprises the following steps:

w_ilarger part, down-sampling (m)_-1-m₁)(1-m₁/m_-1) And (4) sampling.

Wherein, w_iA smaller part indicates that the support vectors around the sample are denser, and excessive downsampling in the region can reduce the diversity of most samples and improve the risk of misclassifying the samples; otherwise, w_iThe larger part shows that the support vector around the sample is sparse, and proper down-sampling in the area can effectively reduce the number of the samples of multiple classes and has lower influence on the classification accuracy.

In a preferred embodiment, the method further comprises:

s17: the classification results are evaluated by means of a confusion matrix.

Specifically, the samples actually in the positive class are divided into two parts, namely a predicted positive class and a predicted negative class, the samples actually in the negative class are divided into two parts, namely a predicted negative class and a predicted positive class, and the numbers of the samples are respectively simplified as follows: TP, FP, TN, FN, confusion matrix as shown in Table 2.

TABLE 2

The classification effect was evaluated by the following indices:

representing the classifier's ability to classify the entire sample;

representing the precision of the classifier;

the recall ratio of the classifier is represented, and the accuracy of a few types of samples is represented;

while considering precision and recall for a few classes.

The evaluation indexes have different emphasis points, and the larger the numerical value of each index is, the better the classification effect is. Focusing on F in the unbalanced classification problem₁，F₁The larger the value, the better the minority class sample classification performance.

The experimental data are analyzed and processed, and the performance of the classification of the invention on unbalanced data is evaluated, wherein the evaluation index comparison results of the bird nest data set on each algorithm are shown in table 3, and the evaluation index comparison results of the insulation data set on each algorithm are shown in table 4.

TABLE 3

TABLE 4

As can be seen from the comparison of tables 3 and 4, the evaluation indexes of the classification of the method are higher than those of the original data and other algorithms, which shows that the method has better performance in the classification of unbalanced data.

It should be understood that all or part of the processes of the classification method for imbalance data according to the present invention may also be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the classification method for imbalance data when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

Fig. 2 is a schematic structural diagram of a preferred embodiment of an imbalance data classification apparatus according to the present invention, which is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.

As shown in fig. 2, the apparatus includes:

a data obtaining module 21, configured to obtain an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;

a support vector calculation module 22, configured to calculate a set of support vectors of the unbalanced data set through an SVM algorithm;

a first distance calculating module 23, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;

a sample position statistic calculation module 24 for calculating a sample position statistic from the first distance;

a class bit statistic calculation module 25, configured to calculate a class bit statistic according to the sample position statistic;

and the downsampling module 26 is configured to downsample the majority class set according to the class bit statistics to obtain a downsampled majority class set.

Specifically, in the embodiment of the present invention, the classification device is applied to transmission line fault classification.

Preferably, the calculation formula of the first distance is as follows:

Preferably, the calculation formula of the sample position statistic is as follows:

in the formula (2), Q_k(s) is x_iK neighbor of (2) support vector set, s_j∈Q_k(s) represents s_jIs Q_kElement in(s), μ_k(x_i) Sample position statistics, μ, for the ith sample in the majority class set_k(x_i) Expresses the ith sample x_iNearby density information.

Preferably, the calculation formula of the class bit statistic is as follows:

Preferably, the down-sampling module 26 is specifically configured to:

w_ilarger part, down-sampling (m)_-1-m₁)(1-m₁/m_-1) And (4) sampling.

Preferably, the apparatus further comprises:

and an evaluation module 27 for evaluating the classification result by means of a confusion matrix.

Fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device according to the present invention, where the terminal device is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.

As shown in fig. 3, the apparatus includes a memory 31, a processor 32; wherein the memory 32 stores therein a computer program configured to be executed by the processor 31, and when being executed by the processor 31, the computer program implements the method for classifying imbalance data according to any one of the above embodiments.

Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 31 and executed by the processor 32 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.

The Processor 32 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may be used for storing the computer programs and/or modules, and the processor 32 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory 31 and calling data stored in the memory 31. The memory 31 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 31 may include a high speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

It should be noted that the terminal device includes, but is not limited to, the processor 32 and the memory 31, and those skilled in the art will understand that the structural diagram of fig. 3 is only an example of the terminal device, and does not constitute a limitation to the terminal device, and may include more components than those shown in the figure, or combine some components, or different components.

The classification method, the device, the storage medium and the equipment for the unbalanced data, provided by the embodiment of the invention, are characterized in that firstly an unbalanced data set is obtained, then a support vector set of the unbalanced data set is calculated through an SVM algorithm, then a first distance from each sample in a majority set to each support vector in the support vector set is calculated, a sample position statistic is calculated according to the first distance again, a class position statistic is calculated according to the sample position statistic, and finally the majority set is downsampled according to the class position statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be noted that, for those skilled in the art, several equivalent obvious modifications and/or equivalent substitutions can be made without departing from the technical principle of the present invention, and these obvious modifications and/or equivalent substitutions should also be regarded as the scope of the present invention.

Claims

1. A method of classifying unbalanced data, the method comprising:

calculating a sample location statistic from the first distance;

calculating a class bit statistic from the sample position statistic;

2. The method of classifying imbalance data according to claim 1, wherein the first distance is calculated by the formula:

3. The method of classifying imbalance data of claim 2, wherein the sample location statistic is calculated as follows:

4. The method of claim 3, wherein the bit-like statistic is calculated as follows:

5. The method of claim 4, wherein the down-sampling the majority class set according to the class bit statistics comprises:

w_ilarger part, down-sampling (m)_-1-m₁)(1-m₁/m_-1) And (4) sampling.

6. A method of classifying imbalance data according to any one of claims 1 to 5, the method further comprising:

the classification results are evaluated by means of a confusion matrix.

7. An apparatus for classifying unbalanced data, the apparatus comprising:

8. The apparatus for classifying imbalance data of claim 7, further comprising:

9. A computer-readable storage medium, in which a computer program is stored, which, when executed, implements the method of classifying imbalance data according to any one of claims 1 to 6.

10. A terminal device, characterized in that the terminal device comprises a memory, a processor and a computer program stored in the memory and configured to be executed by the processor, the computer program, when executed by the processor, implementing the method of classification of imbalance data according to any one of claims 1 to 6.