CN112579711A - Method and device for classifying unbalanced data, storage medium and equipment - Google Patents
Method and device for classifying unbalanced data, storage medium and equipment Download PDFInfo
- Publication number
- CN112579711A CN112579711A CN202011584448.4A CN202011584448A CN112579711A CN 112579711 A CN112579711 A CN 112579711A CN 202011584448 A CN202011584448 A CN 202011584448A CN 112579711 A CN112579711 A CN 112579711A
- Authority
- CN
- China
- Prior art keywords
- data
- sample
- class
- statistic
- majority
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000003860 storage Methods 0.000 title claims abstract description 14
- 239000013598 vector Substances 0.000 claims abstract description 63
- 238000005070 sampling Methods 0.000 claims abstract description 24
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 19
- 238000004590 computer program Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000011156 evaluation Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000003825 pressing Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 abstract description 7
- 238000010801 machine learning Methods 0.000 abstract description 2
- 230000005540 biological transmission Effects 0.000 description 7
- 235000005770 birds nest Nutrition 0.000 description 6
- 235000005765 wild carrot Nutrition 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 239000012212 insulator Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009413 insulation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Economics (AREA)
- General Physics & Mathematics (AREA)
- Public Health (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of machine learning, and discloses a method, a device, a storage medium and equipment for classifying unbalanced data, wherein the method comprises the following steps: the method comprises the steps of obtaining an unbalanced data set, calculating a support vector set of the unbalanced data set through an SVM algorithm, calculating a first distance from each sample in a majority set to each support vector in the support vector set, calculating a sample position statistic according to the first distance, calculating a class bit statistic according to the sample position statistic, and performing down-sampling on the majority set according to the class bit statistic to obtain the down-sampled majority set. According to the method, the device, the storage medium and the equipment for classifying the unbalanced data, provided by the invention, the local density information of the data sample is measured by utilizing the distance between the data sample and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the accuracy of classifying the unbalanced data is improved.
Description
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a storage medium, and a device for classifying unbalanced data.
Background
The solution of the unbalanced data classification problem mainly has two aspects: a data plane and an algorithm plane. The data layer method comprises upsampling and downsampling, and the unbalance degree is reduced and the classification effect is improved by changing the data distribution; in the algorithm level, the classification accuracy is improved by analyzing the defects of the existing algorithm in processing unbalanced data or proposing a new algorithm, such as cost-sensitive learning, ensemble learning and the like.
Currently, unbalanced data is researched, and many improved models based on a SMOTE (Synthetic minimum optimization Oversampling Technique) algorithm are used, but the SMOTE algorithm causes the generated samples of the Minority class to overlap, because the generated samples are generated randomly by each Minority class and the distribution characteristics of adjacent samples are ignored. For example, SVMSMOTE sampling generates new samples based on the hyperplane of the SVM, and BorderlineSMOTE sampling generates samples near the minority class boundary points. The above sampling method reduces the degree of imbalance between classes by increasing the number of samples of a few classes.
The down-sampling refers to reducing the number of most samples to reduce the degree of imbalance between classes, and selecting samples with the number equivalent to that of a few samples in most classes by using some indexes, but some current methods cannot better ensure that only redundant samples and noise samples are removed, and the classification accuracy is low.
Therefore, when solving the problem of unbalanced classification at the data level, it is important to measure the degree of imbalance of data distribution between the minority class and the majority class, and how to increase effective minority class sample data and delete redundant majority class sample data has important research value.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is as follows: the method, the device, the storage medium and the equipment for classifying the unbalanced data measure the local density information of the data samples by using the distance between the data samples and the support vector, consider the unbalanced degree of the data from the aspect of distribution and improve the accuracy of classifying the unbalanced data.
In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a method for classifying unbalanced data, where the method includes:
acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
calculating a sample location statistic from the first distance;
calculating a class bit statistic from the sample position statistic;
and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
As a preferable scheme, the calculation formula of the first distance is as follows:
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
As a preferred solution, the calculation formula of the sample position statistic is as follows:
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) A sample position statistic for the ith sample in the majority class set.
As a preferred scheme, the calculation formula of the class bit statistic is as follows:
in the formula (3), wiClass bit statistics for the ith sample in the majority class set.
As a preferred scheme, the downsampling the majority class set according to the class bit statistics specifically includes:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
As a preferable aspect, the method further includes:
the classification results are evaluated by means of a confusion matrix.
In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides an apparatus for classifying unbalanced data, where the apparatus includes:
the data acquisition module is used for acquiring the unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
the support vector calculation module is used for calculating a support vector set of the unbalanced data set through an SVM algorithm;
a first distance calculation module, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module for calculating a sample position statistic from the first distance;
the class bit statistic calculation module is used for calculating class bit statistic according to the sample position statistic;
and the down-sampling module is used for carrying out down-sampling on the majority class set according to the class bit statistic to obtain the down-sampled majority class set.
As a preferable aspect, the apparatus further includes:
and the evaluation module is used for evaluating the classification result through the confusion matrix.
In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed, implements the method for classifying imbalance data according to any one of the first aspect.
In order to solve the technical problem described above, in a fourth aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and configured to be executed by the processor, and when the computer program is executed by the processor, the terminal device implements the method for classifying unbalanced data according to any one of the first aspect.
Compared with the prior art, the method, the device, the storage medium and the equipment for classifying the unbalanced data have the advantages that: firstly, acquiring an unbalanced data set, then calculating a support vector set of the unbalanced data set through an SVM algorithm, secondly, calculating a first distance from each sample in a majority set to each support vector in the support vector set, thirdly, calculating a sample position statistic according to the first distance, thirdly, calculating a class bit statistic according to the sample position statistic, and lastly, downsampling the majority set according to the class bit statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.
Drawings
In order to more clearly illustrate the technical features of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive labor.
FIG. 1 is a flow chart of a preferred embodiment of a method for classifying imbalance data according to the present invention;
FIG. 2 is a schematic structural diagram of a preferred embodiment of an apparatus for classifying imbalance data according to the present invention;
fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device provided in the present invention.
Detailed Description
In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without inventive step, shall fall within the scope of the present invention.
In the description of the present invention, it should be understood that the numbers themselves, such as "first", "second", etc., are used only for distinguishing the described objects, do not have a sequential or technical meaning, and cannot be understood as defining or implying the importance of the described objects.
It should be noted that, with the rapid development of the economy in China, the demand of the society for electric power is increasing. In order to ensure the normal operation of the transmission line, the method has important significance for the correct classification of common faults. The faults of common power transmission lines are collected by using technologies such as an electric power internet of things technology and an unmanned aerial vehicle information collection system, data are displayed, the limitation of defect data is caused due to the contingency of the fault occurrence of the power transmission lines, and various faults are not determined, so that the data amount of various faults is unbalanced. Aiming at the characteristics of the transmission line faults and the defects of the existing fault classification method, the research on the more effective transmission line fault classification technology has important research value.
Fig. 1 is a schematic flow chart of a classification method of unbalanced data according to a preferred embodiment of the present invention.
As shown in fig. 1, the method comprises the steps of:
s11: acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
s12: calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
s13: calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
s14: calculating a sample location statistic from the first distance;
s15: calculating a class bit statistic from the sample position statistic;
s16: and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
Specifically, in the embodiment of the present invention, the classification method is applied to transmission line fault classification.
As an example, a bird nest data set and an insulated sub data set are selected. Wherein the positive type of the bird nest data set is data with bird nests, and the negative type of the bird nest data set is data without bird nests; the positive type of the insulator data set is insulator damage data, and the negative type is normal insulator data.
First, an unbalanced data set T { (x) in a feature space is obtained1,y1),(x2,y2),…,(xN,yN) Wherein x isi∈RnFor the ith sample, any sample in the unbalanced data set T is denoted xi=(xi1,xi2,…,xin),yi∈{+1,-1},i=1,2,…N,yiThe case of +1 is positive and the case of-1 is negative. Classifying the data set T into a minority class set c according to labels1And majority class set c-1And the number is respectively marked as m1And m-1And is initializedThe positive and negative classification tables of the data set are shown in table 1.
TABLE 1
Then, calculating a support vector set S of the data set T through an SVM algorithm, wherein S represents a support vectorNumber of quantities, any support vector denoted si=(si1,si2,…,sin)。
Second, calculate the majority class set C-1Each sample x iniTo each support vector S in the set S of support vectorsiEach sample xiThe | S | distances are calculated, respectively. In this embodiment, the euclidean distance is used to calculate the first distance, that is, the calculation formula of the first distance is:
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
Again, based on the first distance, a sample location statistic is calculated, an embodiment of the invention for the ith sample x of a given data setiDefining a support vector based sample position statistic, said sample position statistic being calculated as follows:
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) Sample position statistics, μ, for the ith sample in the majority class setk(xi) Express the ithSample xiNearby density information.
Then, calculating a class bit statistic according to the sample position statistic, wherein the formula of calculating the class bit statistic is as follows:
in the formula (3), wiClass bit statistic for the ith sample in the majority class set, wiIs xiAt c-1The ratio in the total density, reflects xiInner density of (2), xiThe higher the density in the vicinity, wiAnd muk(xi) The smaller.
Finally, downsampling the majority class set according to the class bit statistics to obtain the downsampled majority class set, which specifically comprises the following steps:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
Wherein, wiA smaller part indicates that the support vectors around the sample are denser, and excessive downsampling in the region can reduce the diversity of most samples and improve the risk of misclassifying the samples; otherwise, wiThe larger part shows that the support vector around the sample is sparse, and proper down-sampling in the area can effectively reduce the number of the samples of multiple classes and has lower influence on the classification accuracy.
In a preferred embodiment, the method further comprises:
s17: the classification results are evaluated by means of a confusion matrix.
Specifically, the samples actually in the positive class are divided into two parts, namely a predicted positive class and a predicted negative class, the samples actually in the negative class are divided into two parts, namely a predicted negative class and a predicted positive class, and the numbers of the samples are respectively simplified as follows: TP, FP, TN, FN, confusion matrix as shown in Table 2.
TABLE 2
The classification effect was evaluated by the following indices:
the recall ratio of the classifier is represented, and the accuracy of a few types of samples is represented;
The evaluation indexes have different emphasis points, and the larger the numerical value of each index is, the better the classification effect is. Focusing on F in the unbalanced classification problem1,F1The larger the value, the better the minority class sample classification performance.
The experimental data are analyzed and processed, and the performance of the classification of the invention on unbalanced data is evaluated, wherein the evaluation index comparison results of the bird nest data set on each algorithm are shown in table 3, and the evaluation index comparison results of the insulation data set on each algorithm are shown in table 4.
TABLE 3
TABLE 4
As can be seen from the comparison of tables 3 and 4, the evaluation indexes of the classification of the method are higher than those of the original data and other algorithms, which shows that the method has better performance in the classification of unbalanced data.
It should be understood that all or part of the processes of the classification method for imbalance data according to the present invention may also be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the classification method for imbalance data when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
Fig. 2 is a schematic structural diagram of a preferred embodiment of an imbalance data classification apparatus according to the present invention, which is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.
As shown in fig. 2, the apparatus includes:
a data obtaining module 21, configured to obtain an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
a support vector calculation module 22, configured to calculate a set of support vectors of the unbalanced data set through an SVM algorithm;
a first distance calculating module 23, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module 24 for calculating a sample position statistic from the first distance;
a class bit statistic calculation module 25, configured to calculate a class bit statistic according to the sample position statistic;
and the downsampling module 26 is configured to downsample the majority class set according to the class bit statistics to obtain a downsampled majority class set.
Specifically, in the embodiment of the present invention, the classification device is applied to transmission line fault classification.
Preferably, the calculation formula of the first distance is as follows:
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
Preferably, the calculation formula of the sample position statistic is as follows:
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) Sample position statistics, μ, for the ith sample in the majority class setk(xi) Expresses the ith sample xiNearby density information.
Preferably, the calculation formula of the class bit statistic is as follows:
in the formula (3), wiClass bit statistic for the ith sample in the majority class set, wiIs xiAt c-1The ratio in the total density, reflects xiInner density of (2), xiThe higher the density in the vicinity, wiAnd muk(xi) The smaller.
Preferably, the down-sampling module 26 is specifically configured to:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
Preferably, the apparatus further comprises:
and an evaluation module 27 for evaluating the classification result by means of a confusion matrix.
Fig. 3 is a schematic structural diagram of a preferred embodiment of a terminal device according to the present invention, where the terminal device is capable of implementing all processes of the imbalance data classification method according to any of the above embodiments.
As shown in fig. 3, the apparatus includes a memory 31, a processor 32; wherein the memory 32 stores therein a computer program configured to be executed by the processor 31, and when being executed by the processor 31, the computer program implements the method for classifying imbalance data according to any one of the above embodiments.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 31 and executed by the processor 32 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.
The Processor 32 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may be used for storing the computer programs and/or modules, and the processor 32 implements various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory 31 and calling data stored in the memory 31. The memory 31 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 31 may include a high speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
It should be noted that the terminal device includes, but is not limited to, the processor 32 and the memory 31, and those skilled in the art will understand that the structural diagram of fig. 3 is only an example of the terminal device, and does not constitute a limitation to the terminal device, and may include more components than those shown in the figure, or combine some components, or different components.
The classification method, the device, the storage medium and the equipment for the unbalanced data, provided by the embodiment of the invention, are characterized in that firstly an unbalanced data set is obtained, then a support vector set of the unbalanced data set is calculated through an SVM algorithm, then a first distance from each sample in a majority set to each support vector in the support vector set is calculated, a sample position statistic is calculated according to the first distance again, a class position statistic is calculated according to the sample position statistic, and finally the majority set is downsampled according to the class position statistic to obtain the downsampled majority set; the local density information of the data samples is measured by using the distance between the data samples and the support vector, the unbalanced degree of the data is considered from the aspect of distribution, and the downsampling based on the support vector is adopted for the unbalanced classification data, so that redundant majority samples can be effectively deleted, and the accuracy of the unbalanced data classification is improved.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be noted that, for those skilled in the art, several equivalent obvious modifications and/or equivalent substitutions can be made without departing from the technical principle of the present invention, and these obvious modifications and/or equivalent substitutions should also be regarded as the scope of the present invention.
Claims (10)
1. A method of classifying unbalanced data, the method comprising:
acquiring an unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
calculating a set of support vectors of the unbalanced data set through an SVM algorithm;
calculating a first distance from each sample in the majority class set to each support vector in the set of support vectors;
calculating a sample location statistic from the first distance;
calculating a class bit statistic from the sample position statistic;
and performing downsampling on the majority class set according to the class bit statistics to obtain the downsampled majority class set.
2. The method of classifying imbalance data according to claim 1, wherein the first distance is calculated by the formula:
in the formula (1), c-1For the majority class set, xi∈c-1Denotes xiIs c-1Element (ii) of (iii), xiFor the ith sample, x, in the majority class seti=(xi1,xi2,…,xin),sjFor the jth support vector, s, in the set of support vectorsi=(si1,si2,…,sin),d(xi,sj) A first distance between an ith sample in the majority class of samples and a jth support vector in the set of support vectors.
3. The method of classifying imbalance data of claim 2, wherein the sample location statistic is calculated as follows:
in the formula (2), Qk(s) is xiK neighbor of (2) support vector set, sj∈Qk(s) represents sjIs QkElement in(s), μk(xi) A sample position statistic for the ith sample in the majority class set.
5. The method of claim 4, wherein the down-sampling the majority class set according to the class bit statistics comprises:
pressing samples in the majority class set to wiSorting the sizes, and dividing the sizes into two parts according to median;
wismaller part, down-sampling (m)-1-m1)m1/m-1A sample is obtained; wherein m is1For the number of samples in the minority set, m-1The number of samples in the majority class set;
wilarger part, down-sampling (m)-1-m1)(1-m1/m-1) And (4) sampling.
6. A method of classifying imbalance data according to any one of claims 1 to 5, the method further comprising:
the classification results are evaluated by means of a confusion matrix.
7. An apparatus for classifying unbalanced data, the apparatus comprising:
the data acquisition module is used for acquiring the unbalanced data set; wherein the unbalanced data set comprises a majority class set and a minority class set;
the support vector calculation module is used for calculating a support vector set of the unbalanced data set through an SVM algorithm;
a first distance calculation module, configured to calculate a first distance from each sample in the majority class set to each support vector in the set of support vectors;
a sample position statistic calculation module for calculating a sample position statistic from the first distance;
the class bit statistic calculation module is used for calculating class bit statistic according to the sample position statistic;
and the down-sampling module is used for carrying out down-sampling on the majority class set according to the class bit statistic to obtain the down-sampled majority class set.
8. The apparatus for classifying imbalance data of claim 7, further comprising:
and the evaluation module is used for evaluating the classification result through the confusion matrix.
9. A computer-readable storage medium, in which a computer program is stored, which, when executed, implements the method of classifying imbalance data according to any one of claims 1 to 6.
10. A terminal device, characterized in that the terminal device comprises a memory, a processor and a computer program stored in the memory and configured to be executed by the processor, the computer program, when executed by the processor, implementing the method of classification of imbalance data according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011584448.4A CN112579711B (en) | 2020-12-28 | 2020-12-28 | Unbalanced data classification method, device, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011584448.4A CN112579711B (en) | 2020-12-28 | 2020-12-28 | Unbalanced data classification method, device, storage medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112579711A true CN112579711A (en) | 2021-03-30 |
CN112579711B CN112579711B (en) | 2024-09-24 |
Family
ID=75140416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011584448.4A Active CN112579711B (en) | 2020-12-28 | 2020-12-28 | Unbalanced data classification method, device, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112579711B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688831A (en) * | 2017-09-04 | 2018-02-13 | 五邑大学 | A kind of unbalanced data sorting technique based on cluster down-sampling |
US20180210944A1 (en) * | 2017-01-26 | 2018-07-26 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN110163261A (en) * | 2019-04-28 | 2019-08-23 | 平安科技(深圳)有限公司 | Unbalanced data disaggregated model training method, device, equipment and storage medium |
CN110390348A (en) * | 2019-06-11 | 2019-10-29 | 仲恺农业工程学院 | Method, system, device and storage medium for classifying unbalanced data sets |
CN111598116A (en) * | 2019-02-21 | 2020-08-28 | 杭州海康威视数字技术股份有限公司 | Data classification method and device, electronic equipment and readable storage medium |
CN112070125A (en) * | 2020-08-19 | 2020-12-11 | 西安理工大学 | Prediction method of unbalanced data set based on isolated forest learning |
-
2020
- 2020-12-28 CN CN202011584448.4A patent/CN112579711B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180210944A1 (en) * | 2017-01-26 | 2018-07-26 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN107688831A (en) * | 2017-09-04 | 2018-02-13 | 五邑大学 | A kind of unbalanced data sorting technique based on cluster down-sampling |
CN111598116A (en) * | 2019-02-21 | 2020-08-28 | 杭州海康威视数字技术股份有限公司 | Data classification method and device, electronic equipment and readable storage medium |
CN110163261A (en) * | 2019-04-28 | 2019-08-23 | 平安科技(深圳)有限公司 | Unbalanced data disaggregated model training method, device, equipment and storage medium |
CN110390348A (en) * | 2019-06-11 | 2019-10-29 | 仲恺农业工程学院 | Method, system, device and storage medium for classifying unbalanced data sets |
CN112070125A (en) * | 2020-08-19 | 2020-12-11 | 西安理工大学 | Prediction method of unbalanced data set based on isolated forest learning |
Non-Patent Citations (1)
Title |
---|
赵楠;张小芳;张利军;: "不平衡数据分类研究综述", 计算机科学, no. 1, 15 June 2018 (2018-06-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN112579711B (en) | 2024-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109858613B (en) | Compression method and system of deep neural network and terminal equipment | |
WO2019051941A1 (en) | Method, apparatus and device for identifying vehicle type, and computer-readable storage medium | |
JP5880454B2 (en) | Image identification apparatus and program | |
CN113591948A (en) | Defect pattern recognition method and device, electronic equipment and storage medium | |
CN118376839B (en) | Method, device and equipment for positioning newly-added peak point frequency based on DBSCAN algorithm | |
Taschwer et al. | Automatic separation of compound figures in scientific articles | |
CN110672324B (en) | Bearing fault diagnosis method and device based on supervised LLE algorithm | |
CN117523218A (en) | Label generation, training of image classification model and image classification method and device | |
CN111144425A (en) | Method and device for detecting screen shot picture, electronic equipment and storage medium | |
CN115223042A (en) | Target identification method and device based on YOLOv5 network model | |
CN110889424B (en) | Vector index establishing method and device and vector retrieving method and device | |
CN114418226A (en) | Fault analysis method and device of power communication system | |
CN111414993B (en) | Convolutional neural network clipping and convolutional calculation method and device | |
CN116109627B (en) | Defect detection method, device and medium based on migration learning and small sample learning | |
CN112579711A (en) | Method and device for classifying unbalanced data, storage medium and equipment | |
CN110955760A (en) | Evaluation method of judgment result and related device | |
CN111626373B (en) | Multi-scale widening residual error network, small target recognition and detection network and optimization method thereof | |
CN115204318A (en) | Event automatic hierarchical classification method and electronic equipment | |
CN114418114A (en) | Operator fusion method and device, terminal equipment and storage medium | |
CN114528906A (en) | Fault diagnosis method, device, equipment and medium for rotary machine | |
CN113139617A (en) | Power transmission line autonomous positioning method and device and terminal equipment | |
CN114339859B (en) | Method and device for identifying WiFi potential users of full-house wireless network and electronic equipment | |
CN116823069B (en) | Intelligent customer service quality inspection method based on text analysis and related equipment | |
CN115514621B (en) | Fault monitoring method, electronic device and storage medium | |
CN117033220A (en) | Test case processing method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |