CN112579711A - Method and device for classifying unbalanced data, storage medium and equipment - Google Patents

Method and device for classifying unbalanced data, storage medium and equipment Download PDF

Info

Publication number
CN112579711A
CN112579711A CN202011584448.4A CN202011584448A CN112579711A CN 112579711 A CN112579711 A CN 112579711A CN 202011584448 A CN202011584448 A CN 202011584448A CN 112579711 A CN112579711 A CN 112579711A
Authority
CN
China
Prior art keywords
data
sample
class
statistic
majority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011584448.4A
Other languages
Chinese (zh)
Other versions
CN112579711B (en
Inventor
张显聪
杨珏
范旭娟
陈雁
何锦强
廖永力
朱登杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China South Power Grid International Co ltd
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
China South Power Grid International Co ltd
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China South Power Grid International Co ltd, Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical China South Power Grid International Co ltd
Priority to CN202011584448.4A priority Critical patent/CN112579711B/en
Publication of CN112579711A publication Critical patent/CN112579711A/en
Application granted granted Critical
Publication of CN112579711B publication Critical patent/CN112579711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of machine learning, and discloses a method, a device, a storage medium and equipment for classifying unbalanced data, wherein the method comprises the following steps: the method comprises the steps of obtaining an unbalanced data set, calculating a support vector set of the unbalanced data set through an SVM algorithm, calculating a first distance from each sample in a majority set to each support vector in the support vector set, calculating a sample position statistic according to the first distance, calculating a class bit statistic according to the sample position statistic, and performing down-sampling on the majority set according to the class bit statistic to obtain the down-sampled majority set. According to the method, the device, the storage medium and the equipment for classifying the unbalanced data, provided by the invention, the local density information of the data sample is measured by utilizing the distance between the data sample and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the accuracy of classifying the unbalanced data is improved.

Description

Method and device for classifying unbalanced data, storage medium and equipment
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a storage medium, and a device for classifying unbalanced data.
Background
The solution of the unbalanced data classification problem mainly has two aspects: a data plane and an algorithm plane. The data layer method comprises upsampling and downsampling, and the unbalance degree is reduced and the classification effect is improved by changing the data distribution; in the algorithm level, the classification accuracy is improved by analyzing the defects of the existing algorithm in processing unbalanced data or proposing a new algorithm, such as cost-sensitive learning, ensemble learning and the like.
Currently, unbalanced data is researched, and many improved models based on a SMOTE (Synthetic minimum optimization Oversampling Technique) algorithm are used, but the SMOTE algorithm causes the generated samples of the Minority class to overlap, because the generated samples are generated randomly by each Minority class and the distribution characteristics of adjacent samples are ignored. For example, SVMSMOTE sampling generates new samples based on the hyperplane of the SVM, and BorderlineSMOTE sampling generates samples near the minority class boundary points. The above sampling method reduces the degree of imbalance between classes by increasing the number of samples of a few classes.
The down-sampling refers to reducing the number of most samples to reduce the degree of imbalance between classes, and selecting samples with the number equivalent to that of a few samples in most classes by using some indexes, but some current methods cannot better ensure that only redundant samples and noise samples are removed, and the classification accuracy is low.
Therefore, when solving the problem of unbalanced classification at the data level, it is important to measure the degree of imbalance of data distribution between the minority class and the majority class, and how to increase effective minority class sample data and delete redundant majority class sample data has important research value.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is as follows: the method, the device, the storage medium and the equipment for classifying the unbalanced data measure the local density information of the data samples by using the distance between the data samples and the support vector, consider the unbalanced degree of the data from the aspect of distribution and improve the accuracy of classifying the unbalanced data.
In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for classifying unbalanced data, where the method includes:
acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
calculating a sample location statistic from the first distance;
calculating a class bit statistic from the sample position statistic;
and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
As a preferable scheme, the calculation formula of the first distance is as follows:
Figure BDA0002864185890000021
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
As a preferred solution, the calculation formula of the sample position statistic is as follows:
Figure BDA0002864185890000031
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) A sample position statistic for the ith sample in the majority class set.
As a preferred scheme, the calculation formula of the class bit statistic is as follows:
Figure BDA0002864185890000032
in the formula (3), wiClass bit statistics for the ith sample in the majority class set.
As a preferred scheme, the downsampling the majority class set according to the class bit statistics specifically includes:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
As a preferable aspect, the method further includes:
the classification results are evaluated by means of a confusion matrix.
In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides an apparatus for classifying unbalanced data, where the apparatus includes:
the data acquisition module is used for acquiring the unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
the support vector calculation module is used for calculating a support vector set of the unbalanced data set through an SVM algorithm;
a first distance calculation module, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module for calculating a sample position statistic from the first distance;
the class bit statistic calculation module is used for calculating class bit statistic according to the sample position statistic;
and the down-sampling module is used for carrying out down-sampling on the majority class set according to the class bit statistic to obtain the down-sampled majority class set.
As a preferable aspect, the apparatus further includes:
and the evaluation module is used for evaluating the classification result through the confusion matrix.
In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed, implements the method for classifying imbalance data according to any one of the first aspect.
In order to solve the technical problem described above, in a fourth aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and configured to be executed by the processor, and when the computer program is executed by the processor, the terminal device implements the method for classifying unbalanced data according to any one of the first aspect.
Compared with the prior art, the method, the device, the storage medium and the equipment for classifying the unbalanced data have the advantages that: firstly, acquiring an unbalanced data set, then calculating a support vector set of the unbalanced data set through an SVM algorithm, secondly, calculating a first distance from each sample in a majority set to each support vector in the support vector set, thirdly, calculating a sample position statistic according to the first distance, thirdly, calculating a class bit statistic according to the sample position statistic, and lastly, downsampling the majority set according to the class bit statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.
Drawings
In order to more clearly illustrate the technical features of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive labor.
FIG. 1 is a flow chart of a preferred embodiment of a method for classifying imbalance data according to the present invention;
FIG. 2 is a schematic structural diagram of a preferred embodiment of an apparatus for classifying imbalance data according to the present invention;
fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device provided in the present invention.
Detailed Description
In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without inventive step, shall fall within the scope of the present invention.
In the description of the present invention, it should be understood that the numbers themselves, such as "first", "second", etc., are used only for distinguishing the described objects, do not have a sequential or technical meaning, and cannot be understood as defining or implying the importance of the described objects.
It should be noted that, with the rapid development of the economy in China, the demand of the society for electric power is increasing. In order to ensure the normal operation of the transmission line, the method has important significance for the correct classification of common faults. The faults of common power transmission lines are collected by using technologies such as an electric power internet of things technology and an unmanned aerial vehicle information collection system, data are displayed, the limitation of defect data is caused due to the contingency of the fault occurrence of the power transmission lines, and various faults are not determined, so that the data amount of various faults is unbalanced. Aiming at the characteristics of the transmission line faults and the defects of the existing fault classification method, the research on the more effective transmission line fault classification technology has important research value.
Fig. 1 is a schematic flow chart of a classification method of unbalanced data according to a preferred embodiment of the present invention.
As shown in fig. 1, the method comprises the steps of:
s11: acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
s12: calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
s13: calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
s14: calculating a sample location statistic from the first distance;
s15: calculating a class bit statistic from the sample position statistic;
s16: and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
Specifically, in the embodiment of the present invention, the classification method is applied to transmission line fault classification.
As an example, a bird nest data set and an insulated sub data set are selected. Wherein the positive type of the bird nest data set is data with bird nests, and the negative type of the bird nest data set is data without bird nests; the positive type of the insulator data set is insulator damage data, and the negative type is normal insulator data.
First, an unbalanced data set T { (x) in a feature space is obtained1,y1),(x2,y2),…,(xN,yN) Wherein x isi∈RnFor the ith sample, any sample in the unbalanced data set T is denoted xi=(xi1,xi2,…,xin),yi∈{+1,-1},i=1,2,…N,yiThe case of +1 is positive and the case of-1 is negative. Classifying the data set T into a minority class set c according to labels1And majority class set c-1And the number is respectively marked as m1And m-1And is initialized
Figure BDA0002864185890000071
The positive and negative classification tables of the data set are shown in table 1.
TABLE 1
Figure BDA0002864185890000072
Then, calculating a support vector set S of the data set T through an SVM algorithm, wherein S represents a support vectorNumber of quantities, any support vector denoted si=(si1,si2,…,sin)。
Second, calculate the majority class set C-1Each sample x iniTo each support vector S in the set S of support vectorsiEach sample xiThe | S | distances are calculated, respectively. In this embodiment, the euclidean distance is used to calculate the first distance, that is, the calculation formula of the first distance is:
Figure BDA0002864185890000073
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
Again, based on the first distance, a sample location statistic is calculated, an embodiment of the invention for the ith sample x of a given data setiDefining a support vector based sample position statistic, said sample position statistic being calculated as follows:
Figure BDA0002864185890000081
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) Sample position statistics, μ, for the ith sample in the majority class setk(xi) Express the ithSample xiNearby density information.
Then, calculating a class bit statistic according to the sample position statistic, wherein the formula of calculating the class bit statistic is as follows:
Figure BDA0002864185890000082
in the formula (3), wiClass bit statistic for the ith sample in the majority class set, wiIs xiAt c-1The ratio in the total density, reflects xiInner density of (2), xiThe higher the density in the vicinity, wiAnd muk(xi) The smaller.
Finally, downsampling the majority class set according to the class bit statistics to obtain the downsampled majority class set, which specifically comprises the following steps:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
Wherein, wiA smaller part indicates that the support vectors around the sample are denser, and excessive downsampling in the region can reduce the diversity of most samples and improve the risk of misclassifying the samples; otherwise, wiThe larger part shows that the support vector around the sample is sparse, and proper down-sampling in the area can effectively reduce the number of the samples of multiple classes and has lower influence on the classification accuracy.
In a preferred embodiment, the method further comprises:
s17: the classification results are evaluated by means of a confusion matrix.
Specifically, the samples actually in the positive class are divided into two parts, namely a predicted positive class and a predicted negative class, the samples actually in the negative class are divided into two parts, namely a predicted negative class and a predicted positive class, and the numbers of the samples are respectively simplified as follows: TP, FP, TN, FN, confusion matrix as shown in Table 2.
TABLE 2
Figure BDA0002864185890000091
The classification effect was evaluated by the following indices:
Figure BDA0002864185890000092
representing the classifier's ability to classify the entire sample;
Figure BDA0002864185890000093
representing the precision of the classifier;
Figure BDA0002864185890000094
the recall ratio of the classifier is represented, and the accuracy of a few types of samples is represented;
Figure BDA0002864185890000095
while considering precision and recall for a few classes.
The evaluation indexes have different emphasis points, and the larger the numerical value of each index is, the better the classification effect is. Focusing on F in the unbalanced classification problem1,F1The larger the value, the better the minority class sample classification performance.
The experimental data are analyzed and processed, and the performance of the classification of the invention on unbalanced data is evaluated, wherein the evaluation index comparison results of the bird nest data set on each algorithm are shown in table 3, and the evaluation index comparison results of the insulation data set on each algorithm are shown in table 4.
TABLE 3
Figure BDA0002864185890000101
TABLE 4
Figure BDA0002864185890000102
As can be seen from the comparison of tables 3 and 4, the evaluation indexes of the classification of the method are higher than those of the original data and other algorithms, which shows that the method has better performance in the classification of unbalanced data.
It should be understood that all or part of the processes of the classification method for imbalance data according to the present invention may also be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the classification method for imbalance data when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
Fig. 2 is a schematic structural diagram of a preferred embodiment of an imbalance data classification apparatus according to the present invention, which is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.
As shown in fig. 2, the apparatus includes:
a data obtaining module 21, configured to obtain an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
a support vector calculation module 22, configured to calculate a set of support vectors of the unbalanced data set through an SVM algorithm;
a first distance calculating module 23, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module 24 for calculating a sample position statistic from the first distance;
a class bit statistic calculation module 25, configured to calculate a class bit statistic according to the sample position statistic;
and the downsampling module 26 is configured to downsample the majority class set according to the class bit statistics to obtain a downsampled majority class set.
Specifically, in the embodiment of the present invention, the classification device is applied to transmission line fault classification.
Preferably, the calculation formula of the first distance is as follows:
Figure BDA0002864185890000121
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
Preferably, the calculation formula of the sample position statistic is as follows:
Figure BDA0002864185890000122
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) Sample position statistics, μ, for the ith sample in the majority class setk(xi) Expresses the ith sample xiNearby density information.
Preferably, the calculation formula of the class bit statistic is as follows:
Figure BDA0002864185890000123
in the formula (3), wiClass bit statistic for the ith sample in the majority class set, wiIs xiAt c-1The ratio in the total density, reflects xiInner density of (2), xiThe higher the density in the vicinity, wiAnd muk(xi) The smaller.
Preferably, the down-sampling module 26 is specifically configured to:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
Preferably, the apparatus further comprises:
and an evaluation module 27 for evaluating the classification result by means of a confusion matrix.
Fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device according to the present invention, where the terminal device is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.
As shown in fig. 3, the apparatus includes a memory 31, a processor 32; wherein the memory 32 stores therein a computer program configured to be executed by the processor 31, and when being executed by the processor 31, the computer program implements the method for classifying imbalance data according to any one of the above embodiments.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 31 and executed by the processor 32 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.
The Processor 32 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may be used for storing the computer programs and/or modules, and the processor 32 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory 31 and calling data stored in the memory 31. The memory 31 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 31 may include a high speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
It should be noted that the terminal device includes, but is not limited to, the processor 32 and the memory 31, and those skilled in the art will understand that the structural diagram of fig. 3 is only an example of the terminal device, and does not constitute a limitation to the terminal device, and may include more components than those shown in the figure, or combine some components, or different components.
The classification method, the device, the storage medium and the equipment for the unbalanced data, provided by the embodiment of the invention, are characterized in that firstly an unbalanced data set is obtained, then a support vector set of the unbalanced data set is calculated through an SVM algorithm, then a first distance from each sample in a majority set to each support vector in the support vector set is calculated, a sample position statistic is calculated according to the first distance again, a class position statistic is calculated according to the sample position statistic, and finally the majority set is downsampled according to the class position statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be noted that, for those skilled in the art, several equivalent obvious modifications and/or equivalent substitutions can be made without departing from the technical principle of the present invention, and these obvious modifications and/or equivalent substitutions should also be regarded as the scope of the present invention.

Claims (10)

1. A method of classifying unbalanced data, the method comprising:
acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
calculating a sample location statistic from the first distance;
calculating a class bit statistic from the sample position statistic;
and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
2. The method of classifying imbalance data according to claim 1, wherein the first distance is calculated by the formula:
Figure FDA0002864185880000011
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
3. The method of classifying imbalance data of claim 2, wherein the sample location statistic is calculated as follows:
Figure FDA0002864185880000021
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) A sample position statistic for the ith sample in the majority class set.
4. The method of claim 3, wherein the bit-like statistic is calculated as follows:
Figure FDA0002864185880000022
in the formula (3), wiClass bit statistics for the ith sample in the majority class set.
5. The method of claim 4, wherein the down-sampling the majority class set according to the class bit statistics comprises:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
6. A method of classifying imbalance data according to any one of claims 1 to 5, the method further comprising:
the classification results are evaluated by means of a confusion matrix.
7. An apparatus for classifying unbalanced data, the apparatus comprising:
the data acquisition module is used for acquiring the unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
the support vector calculation module is used for calculating a support vector set of the unbalanced data set through an SVM algorithm;
a first distance calculation module, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module for calculating a sample position statistic from the first distance;
the class bit statistic calculation module is used for calculating class bit statistic according to the sample position statistic;
and the down-sampling module is used for carrying out down-sampling on the majority class set according to the class bit statistic to obtain the down-sampled majority class set.
8. The apparatus for classifying imbalance data of claim 7, further comprising:
and the evaluation module is used for evaluating the classification result through the confusion matrix.
9. A computer-readable storage medium, in which a computer program is stored, which, when executed, implements the method of classifying imbalance data according to any one of claims 1 to 6.
10. A terminal device, characterized in that the terminal device comprises a memory, a processor and a computer program stored in the memory and configured to be executed by the processor, the computer program, when executed by the processor, implementing the method of classification of imbalance data according to any one of claims 1 to 6.
CN202011584448.4A 2020-12-28 2020-12-28 Unbalanced data classification method, device, storage medium and equipment Active CN112579711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011584448.4A CN112579711B (en) 2020-12-28 2020-12-28 Unbalanced data classification method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011584448.4A CN112579711B (en) 2020-12-28 2020-12-28 Unbalanced data classification method, device, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN112579711A true CN112579711A (en) 2021-03-30
CN112579711B CN112579711B (en) 2024-09-24

Family

ID=75140416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011584448.4A Active CN112579711B (en) 2020-12-28 2020-12-28 Unbalanced data classification method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112579711B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688831A (en) * 2017-09-04 2018-02-13 五邑大学 A kind of unbalanced data sorting technique based on cluster down-sampling
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN110163261A (en) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 Unbalanced data disaggregated model training method, device, equipment and storage medium
CN110390348A (en) * 2019-06-11 2019-10-29 仲恺农业工程学院 Method, system, device and storage medium for classifying unbalanced data sets
CN111598116A (en) * 2019-02-21 2020-08-28 杭州海康威视数字技术股份有限公司 Data classification method and device, electronic equipment and readable storage medium
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN107688831A (en) * 2017-09-04 2018-02-13 五邑大学 A kind of unbalanced data sorting technique based on cluster down-sampling
CN111598116A (en) * 2019-02-21 2020-08-28 杭州海康威视数字技术股份有限公司 Data classification method and device, electronic equipment and readable storage medium
CN110163261A (en) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 Unbalanced data disaggregated model training method, device, equipment and storage medium
CN110390348A (en) * 2019-06-11 2019-10-29 仲恺农业工程学院 Method, system, device and storage medium for classifying unbalanced data sets
CN112070125A (en) * 2020-08-19 2020-12-11 西安理工大学 Prediction method of unbalanced data set based on isolated forest learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵楠;张小芳;张利军;: "不平衡数据分类研究综述", 计算机科学, no. 1, 15 June 2018 (2018-06-15) *

Also Published As

Publication number Publication date
CN112579711B (en) 2024-09-24

Similar Documents

Publication Publication Date Title
CN109858613B (en) Compression method and system of deep neural network and terminal equipment
WO2019051941A1 (en) Method, apparatus and device for identifying vehicle type, and computer-readable storage medium
JP5880454B2 (en) Image identification apparatus and program
CN113591948A (en) Defect pattern recognition method and device, electronic equipment and storage medium
CN118376839B (en) Method, device and equipment for positioning newly-added peak point frequency based on DBSCAN algorithm
Taschwer et al. Automatic separation of compound figures in scientific articles
CN110672324B (en) Bearing fault diagnosis method and device based on supervised LLE algorithm
CN117523218A (en) Label generation, training of image classification model and image classification method and device
CN111144425A (en) Method and device for detecting screen shot picture, electronic equipment and storage medium
CN115223042A (en) Target identification method and device based on YOLOv5 network model
CN110889424B (en) Vector index establishing method and device and vector retrieving method and device
CN114418226A (en) Fault analysis method and device of power communication system
CN111414993B (en) Convolutional neural network clipping and convolutional calculation method and device
CN116109627B (en) Defect detection method, device and medium based on migration learning and small sample learning
CN112579711A (en) Method and device for classifying unbalanced data, storage medium and equipment
CN110955760A (en) Evaluation method of judgment result and related device
CN111626373B (en) Multi-scale widening residual error network, small target recognition and detection network and optimization method thereof
CN115204318A (en) Event automatic hierarchical classification method and electronic equipment
CN114418114A (en) Operator fusion method and device, terminal equipment and storage medium
CN114528906A (en) Fault diagnosis method, device, equipment and medium for rotary machine
CN113139617A (en) Power transmission line autonomous positioning method and device and terminal equipment
CN114339859B (en) Method and device for identifying WiFi potential users of full-house wireless network and electronic equipment
CN116823069B (en) Intelligent customer service quality inspection method based on text analysis and related equipment
CN115514621B (en) Fault monitoring method, electronic device and storage medium
CN117033220A (en) Test case processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant