CN112396124B - Small sample data expansion method and system for unbalanced data - Google Patents

Small sample data expansion method and system for unbalanced data Download PDF

Info

Publication number
CN112396124B
CN112396124B CN202011384923.3A CN202011384923A CN112396124B CN 112396124 B CN112396124 B CN 112396124B CN 202011384923 A CN202011384923 A CN 202011384923A CN 112396124 B CN112396124 B CN 112396124B
Authority
CN
China
Prior art keywords
data set
positive
samples
sample
positive sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011384923.3A
Other languages
Chinese (zh)
Other versions
CN112396124A (en
Inventor
柴森春
王昭洋
周泰民
崔灵果
李慧芳
姚分喜
张百海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202011384923.3A priority Critical patent/CN112396124B/en
Publication of CN112396124A publication Critical patent/CN112396124A/en
Application granted granted Critical
Publication of CN112396124B publication Critical patent/CN112396124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention relates to a small sample data expansion method and system facing unbalanced data, wherein the unbalanced data of an MES system is extracted from an upper-layer platform of an MES interconnection and intercommunication system, the level of the unbalanced data of the MES system is divided according to the difference value of the number of positive samples in a positive sample data set and the number of negative samples in a negative sample data set, different expansion methods are adopted for different levels of the unbalanced data of the MES system, when the positive sample data set is a first-level unbalanced data set, a Borderline-SMOTE algorithm is adopted to expand positive sample points at the boundary, the boundary fuzzy problem caused by traditional oversampling can be effectively avoided, and the data quality is improved to a certain extent; when the positive sample data set is the second-level unbalanced data set, the SMOTE algorithm based on the density is adopted for expansion, compared with the traditional SMOTE algorithm, the problem of boundary ambiguity is reduced, and the quality of small sample data expansion is ensured.

Description

Small sample data expansion method and system for unbalanced data
Technical Field
The invention relates to the technical field of data expansion, in particular to a small sample data expansion method and system for unbalanced data.
Background
With the continuous development of artificial intelligence, machine learning technology is now deeply applied to industrial production. However, when the machine learning technology is applied to deal with the problems of classification, regression and the like in the production flow, the problem of data imbalance is often faced. For example, the number of failure data in the failure diagnosis problem in the industrial field is far smaller than that of normal data, the number of cases in the medical field diagnosis is small, credit card transaction fraud prediction in the financial field, network intrusion prediction in the network security field and the like all bring certain difficulties due to data imbalance. In the problem of data imbalance, most basic models such as a majority (negative sample) sample number far greater than a minority (positive sample) sample number are prone to being inclined to parameter updating of the majority sample and neglecting correct classification of the minority sample in the process of achieving the goal of maximizing overall classification accuracy, so that the minority sample is difficult to learn by a classifier, and the classification accuracy of the minority sample is often concerned more. Therefore, certain techniques are needed to perform reasonable data expansion on a small number of classes of samples in the MES system.
Currently, under-sampling, over-sampling methods and ensemble learning methods are widely used to alleviate the training problem of unbalanced data sets at the data sampling level and the algorithm optimization level, respectively. The common oversampling technique (sampling technique) balances the number of samples of each fault class by amplifying a few classes of faults, but simple copy samples easily cause overfitting to severely unbalanced data in the MES system, and newly generated samples easily cause the defect of overlapping among fault sample classes, so that an oversampling algorithm cannot ensure the quality of the amplified data to a certain extent.
Disclosure of Invention
The invention aims to provide a small sample data expansion method and system for unbalanced data, so as to overcome the defects of overfitting of small sample data expansion and data overlapping between newly generated samples in the prior art and ensure the quality of small sample data expansion.
In order to achieve the purpose, the invention provides the following scheme:
a small sample data expansion method oriented to unbalanced data, the method comprising:
extracting MES system unbalanced data from an upper platform of an MES interconnection and interworking system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;
obtaining a difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, and judging whether the difference value is smaller than a difference value threshold value to obtain a first judgment result;
if the first judgment result shows that the data set is the first-level unbalanced data set, judging that the positive sample data set is the first-level unbalanced data set;
adopting a Borderline-SMOTE algorithm to expand the first-level unbalanced data set to obtain an expanded positive sample data set;
if the first judgment result shows that the data set is not the first-level unbalanced data set, judging that the first-level unbalanced data set is a second-level unbalanced data set;
expanding the second-level unbalanced data set by adopting a density-based SMOTE algorithm to obtain an expanded positive sample data set;
and the expanded positive sample data set and the negative sample data set form an MES system balanced data set.
Optionally, the expanding the first-level unbalanced data set by using a Borderline-SMOTE algorithm to obtain an expanded positive sample data set specifically includes:
acquiring K nearest neighbor samples of each positive sample in the first-level unbalanced data set and each positive sample in the sample data set;
respectively counting the number of neighbor positive samples and the number of neighbor negative samples in the K nearest neighbor samples of each positive sample;
determining positive samples corresponding to the positive samples with the number of the adjacent positive samples being less than the number of the adjacent negative samples and the number of the adjacent positive samples being more than 0 as boundary positive samples of the positive and negative sample boundaries;
according to the boundary positive sample and the nearest neighbor positive sample in the K neighbor samples of the boundary positive sample, utilizing a formula
Figure GDA0003964669830000031
Obtaining a new positive sample for each boundary positive sample in the first level imbalance dataset;
new positive samples of all boundary positive samples in the first level unbalanced data set constitute a first new positive sample set;
merging the first new positive sample set and the first level unbalanced data set to obtain an expanded first level unbalanced data set;
judging whether the number of positive samples in the expanded first-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not, and obtaining a second judgment result;
if the second judgment result shows that the first-level unbalanced data set is the original first-level unbalanced data set, randomly deleting new positive samples in the expanded first-level unbalanced data set, enabling the number of the positive samples in the expanded first-level unbalanced data set after deletion to be equal to the number of the negative samples in the negative sample data set, and outputting the expanded first-level unbalanced data set after deletion;
if the second judgment result shows that the first level unbalanced data set is not the same as the second level unbalanced data set, judging whether the number of positive samples in the expanded first level unbalanced data set is equal to the number of negative samples in the negative sample data set or not, and obtaining a third judgment result;
if the third judgment result shows that the data set is true, outputting the expanded first-level unbalanced data set;
if the third judgment result shows that the first-level unbalanced data set is not the expanded first-level unbalanced data set, returning to the step of obtaining K nearest neighbor samples of each positive sample in the sample data set in the first-level unbalanced data set, wherein the K nearest neighbor samples are the nearest to each positive sample in the sample data set;
wherein p is i For the ith boundary positive sample,
Figure GDA0003964669830000032
is a boundary positive sample p i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i The number of nearest positive samples, p, of the K nearest neighbor samples inew,d Is a positive sample p i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).
Optionally, the expanding the second-level unbalanced data set by using a density-based SMOTE algorithm to obtain an expanded positive sample data set specifically includes:
using formulas
Figure GDA0003964669830000041
Determining a density of each positive sample in the second level imbalance dataset;
normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;
arranging the normalized densities of all positive samples in the second-level unbalanced data set from large to small to form a density set;
initializing a second new positive sample set as an empty set;
let l =1;
determining a positive sample corresponding to the ith density in the density set;
using the formula m l =κ×ρ(p l ) DeterminingPositive sample p corresponding to the ith density in the density set l Number m of samples l
Obtaining positive samples p in the second level imbalance dataset l Nearest m l A neighboring positive sample;
according to the positive sample p l And m l Individual neighbor positive samples, using the formula
Figure GDA0003964669830000042
Obtaining m l A new sample;
m is to be l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;
merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;
judging whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a fourth judgment result;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, randomly deleting new positive samples in the expanded second-level unbalanced data set, and outputting the deleted expanded second-level unbalanced data set;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is not equal to the number of the negative samples in the negative sample data set, and a fifth judgment result is obtained;
if the fifth judgment result shows that the data set is the second-level unbalanced data set, outputting the expanded second-level unbalanced data set;
if the fifth judgment result shows that the density set is not the first density set, increasing the value l by 1, and returning to the step of determining a positive sample corresponding to the first density set;
wherein l is more than or equal to 1 and less than or equal to min date, p j Is a second level of unevennessWeighing the jth positive sample, p, in the data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the second level unbalanced dataset j Min date is the number of positive samples in the second level unbalanced data set, dis (p) j ,p g ) For positive samples p in the second level unbalanced dataset j And positive sample p g The Euclidean distance between the two electrodes,
Figure GDA0003964669830000051
n is the dimension of the positive sample, p jk Is a positive sample p j Of the k-th dimension, p gk Is a positive sample p g K, is an adjustable coefficient,
Figure GDA0003964669830000052
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated from the b-th neighbor of the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
Optionally, the augmented positive sample data set and the negative sample data set constitute an MES system balancing data set, and then the method further includes:
and visualizing and storing the MES system balance data set.
An unbalanced data oriented small sample data augmentation system, the system comprising:
the system comprises a sample data set forming module, a data processing system (MES) data set analyzing module and a data processing module, wherein the sample data set forming module is used for extracting the MES system unbalanced data from an upper platform of an MES interconnection and intercommunication system and forming the MES system unbalanced data into a sample data set; the sample data set comprises a positive sample data set and a negative sample data set;
a first judgment result obtaining module, configured to obtain a difference between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judge whether the difference is smaller than a difference threshold, and obtain a first judgment result;
the first-grade unbalanced data set judging module is used for judging that the positive sample data set is a first-grade unbalanced data set if the first judgment result shows that the positive sample data set is positive;
a first extended positive sample data set obtaining module, configured to extend the first-level unbalanced data set by using a Borderline-SMOTE algorithm to obtain an extended positive sample data set;
a second-level unbalanced data set determination module, configured to determine that the positive sample data set is a second-level unbalanced data set if the first determination result indicates no;
a second extended positive sample data set obtaining module, configured to extend the second level unbalanced data set by using a density-based SMOTE algorithm to obtain an extended positive sample data set;
and the MES system balanced data set forming module is used for forming an MES system balanced data set by the expanded positive sample data set and the negative sample data set.
Optionally, the first extended positive sample data set obtaining module specifically includes:
a neighbor sample obtaining sub-module, configured to obtain K nearest neighbor samples of each positive sample in the first-level unbalanced data set to each positive sample in the sample data set;
the quantity counting submodule is used for respectively counting the quantity of the neighbor positive samples and the quantity of the neighbor negative samples in the K nearest neighbor samples of each positive sample;
the boundary positive sample determining submodule is used for determining the positive samples corresponding to the positive samples of which the number of the neighbor positive samples is less than that of the neighbor negative samples and the number of the neighbor positive samples is more than 0 as boundary positive samples of the positive and negative sample boundaries;
a new positive sample obtaining sub-module for obtaining a positive sample according to the boundary positive sample and the nearest K neighboring samples of the boundary positive sample by using a formula
Figure GDA0003964669830000061
Obtaining each boundary positive sample in the first level unbalanced data setA new positive sample of the book;
a first new positive sample set forming submodule, configured to form a first new positive sample set from new positive samples of all boundary positive samples in the first level unbalanced data set;
the expanded first-level unbalanced data set obtaining submodule is used for merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;
a second judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a second judgment result;
the deleted expanded first-level unbalanced data set output sub-module is used for randomly deleting new positive samples in the expanded first-level unbalanced data set if the second judgment result shows that the new positive samples are positive, enabling the number of the positive samples in the deleted expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded first-level unbalanced data set;
a third determination result obtaining sub-module, configured to determine, if the second determination result indicates no, whether the number of positive samples in the expanded first-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a third determination result;
the expanded positive sample data set output submodule is used for outputting the expanded first-level unbalanced data set if the third judgment result shows that the expanded first-level unbalanced data set is true;
an updating submodule, configured to update the first-level unbalanced data set to an expanded first-level unbalanced data set if the third determination result indicates that the data set is not the first-level unbalanced data set, returning to the step of obtaining K nearest neighbor samples of each positive sample in the sample data set to each positive sample in the first-level unbalanced data set;
wherein p is i For the ith boundary positive sample, the first boundary,
Figure GDA0003964669830000071
is a boundary positive sample p i The d nearest positive sample in the K nearest neighbor samples, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i The number of nearest positive samples, p, of the K nearest neighbor samples inew,d Is a positive sample p i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).
Optionally, the module for obtaining the second extended positive sample data set specifically includes:
a density determination submodule for utilizing a formula
Figure GDA0003964669830000072
Determining a density of each positive sample in the second level imbalance dataset;
the normalized density obtaining submodule is used for normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;
the density set forming submodule is used for arranging the normalized densities of all the positive samples in the second-level unbalanced data set from large to small to form a density set;
a second new positive sample set initialization submodule, configured to initialize the second new positive sample set to an empty set;
an initial value setting submodule for letting l =1;
the positive sample determining submodule corresponding to the density is used for determining a positive sample corresponding to the ith density in the density set;
a sub-module for determining the number of samples for using the formula m l =κ×ρ(p l ) Determining a positive sample p corresponding to the ith density in the density set l Number m of samples l
A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level imbalance dataset l More recently, the development of new and more sophisticated displaysM of l A neighboring positive sample;
a new sample obtaining submodule for obtaining a new sample from the positive sample p l And m l Individual neighbor positive samples, using the formula
Figure GDA0003964669830000081
Obtaining m l A new sample;
an updated second new positive sample set obtaining submodule for obtaining m l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;
an expanded second-level unbalanced data set obtaining submodule, configured to merge the updated second new positive sample set and the second-level unbalanced data set, so as to obtain an expanded second-level unbalanced data set;
a fourth determination result obtaining sub-module, configured to determine whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a fourth determination result;
a deleted expanded second-level unbalanced data set output submodule, configured to randomly delete a new positive sample in the expanded second-level unbalanced data set if the fourth determination result indicates that the number of positive samples in the deleted expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and output the deleted expanded second-level unbalanced data set;
a fifth judgment result obtaining sub-module, configured to, if the fourth judgment result indicates no, judge whether the number of positive samples in the expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a fifth judgment result;
the expanded second positive sample data set output submodule is used for outputting the expanded second-level unbalanced data set if the fifth judgment result shows that the expanded second positive sample data set is positive;
a returning step submodule, configured to increase l by 1 if the fifth determination result indicates that the density is negative, and return to the step "determine a positive sample corresponding to the ith density in the density set";
wherein l is more than or equal to 1 and less than or equal to min date, p j For the jth positive sample, p, in the second level unbalanced data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the second level unbalanced dataset j Min date is the number of positive samples in the second level unbalanced data set, dis (p) j ,p g ) For positive samples p in the unbalanced dataset of the second level j And positive sample p g The Euclidean distance between the two electrodes,
Figure GDA0003964669830000091
n is the dimension of the positive sample, p jk Is a positive sample p j Of the k-th dimension, p gk Is a positive sample p g K, is an adjustable coefficient,
Figure GDA0003964669830000092
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated by the b-th neighbor positive sample in the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
Optionally, the system further includes:
and the visualization and storage module is used for visualizing and storing the MES system balance data set.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a small sample data expansion method and system facing unbalanced data, wherein different expansion methods are adopted for different levels of unbalanced data of an MES system, when a positive sample data set is a first-level unbalanced data set, a Borderline-SMOTE algorithm is adopted to expand positive sample points at a boundary, so that the problem of boundary ambiguity caused by traditional oversampling can be effectively avoided, and the data quality is improved to a certain extent; when the positive sample data set is the second-level unbalanced data set, the SMOTE algorithm based on the density is adopted for expansion, compared with the traditional SMOTE algorithm, the problem of boundary ambiguity is reduced, and the quality of small sample data expansion is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a small sample data expansion method for unbalanced data according to the present invention;
FIG. 2 is a schematic diagram of the Borderline-SMOTE algorithm provided by the present invention;
fig. 3 is a schematic diagram of the SMOTE algorithm provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a small sample data expansion method and system for unbalanced data, so as to overcome the defects of overfitting of small sample data expansion and data overlapping between newly generated samples in the prior art and ensure the quality of small sample data expansion.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides a small sample data expansion method facing unbalanced data, as shown in fig. 1, the method comprises:
s101, extracting MES system unbalanced data from an upper-layer platform of an MES interconnection and interworking system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;
s102, obtaining a difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judging whether the difference value is smaller than a difference value threshold value, and obtaining a first judgment result;
s103, if the first judgment result shows that the data set is the first-level unbalanced data set, judging that the positive sample data set is the first-level unbalanced data set;
s104, expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;
s105, if the first judgment result shows that the data set is not the second-level unbalanced data set, judging that the positive sample data set is the second-level unbalanced data set;
s106, expanding the second-level unbalanced data set by adopting a density-based SMOTE algorithm to obtain an expanded positive sample data set;
and S107, the expanded positive sample data set and the expanded negative sample data set form an MES system balance data set.
The specific process is as follows:
step S101, accessing an MES upper layer data management layer to extract data, defaulting the extracted data to data subjected to data preprocessing, and defaulting the MES data to contain two types, namely a majority type (negative sample) and a minority type (positive sample). The positive sample set is denoted as P and the negative sample set is denoted as N.
In step S102, the number of positive samples (denoted as min date) and the number of negative samples (denoted as max date) in the extracted MES data are respectively counted, and if the difference between min date and max date is less than 20%, it is determined that the data set is a slightly unbalanced data set, i.e. a first-level unbalanced data set (step S103). If the difference between min date and max date is not less than 20%, the data set is determined to be a severely unbalanced data set, i.e. a second level unbalanced data set (step S105). And the subsequent data expansion module expands the MES data according to the evaluation result.
Step S104, for slightly unbalanced data, the purpose of data balance can be achieved only by performing a small amount of expansion on the positive sample, and the first-level unbalanced data set is expanded by using the Borderline-SMOTE algorithm to obtain an expanded positive sample data set, as shown in fig. 2, specifically including:
acquiring K nearest neighbor samples of each positive sample in the first-level unbalanced data set and each positive sample in the sample data set;
respectively counting the number of neighbor positive samples and the number of neighbor negative samples in the K nearest neighbor samples of each positive sample;
determining positive samples corresponding to the condition that the number of the neighboring positive samples is less than that of the neighboring negative samples and the number of the neighboring positive samples is greater than 0 as boundary positive samples of the boundary of the positive and negative samples;
according to the boundary positive sample and the nearest neighbor positive sample in the K neighbor samples of the boundary positive sample, utilizing a formula
Figure GDA0003964669830000111
Obtaining a new positive sample of each boundary positive sample in the first level unbalanced data set;
new positive samples of all boundary positive samples in the first level unbalanced data set form a first new positive sample set;
merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;
judging whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a second judgment result;
if the second judgment result shows that the first-level unbalanced data set is the original first-level unbalanced data set, randomly deleting new positive samples in the expanded first-level unbalanced data set, enabling the number of the positive samples in the expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the expanded first-level unbalanced data set after deletion;
if the second judgment result shows that the first level unbalanced data set is not the negative sample set, judging whether the number of positive samples in the expanded first level unbalanced data set is equal to the number of negative samples in the negative sample set or not, and obtaining a third judgment result;
if the third judgment result shows that the data set is true, outputting the expanded first-level unbalanced data set;
if the third judgment result shows that the first grade unbalanced data set is not the expanded first grade unbalanced data set, returning to the step of obtaining K nearest neighbor samples, which are the nearest to each positive sample in the sample data set, of each positive sample in the first grade unbalanced data set;
wherein p is i For the ith boundary positive sample,
Figure GDA0003964669830000121
is a boundary positive sample p i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i The number of nearest positive samples, p, of the K nearest neighbor samples inew,d Is a positive sample p i Is a random function, and rand (0, 1) is a random number generated in one (0, 1), n in FIG. 2 i Is a boundary positive sample p i The number of neighbor negative samples in the K neighbor samples nearby;
step S106, expanding the second-level unbalanced data set by using a density-based SMOTE (Synthetic minimum Oversampling Technique) algorithm to obtain an expanded positive sample data set, as shown in fig. 3, which specifically includes:
using a formula
Figure GDA0003964669830000122
Determining a density of each positive sample in the second level imbalance dataset; by rho (p) j ) Measure sample point p j Local density of p (p) j ) The larger the representative sample point p j The greater the density in the vicinity, the more concentrated the sample points, ρ (p) j ) The smaller the representative sample point p j The smaller the nearby density.
Normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set; the normalization formula used is as follows:
Figure GDA0003964669830000123
where ρ is min Is the minimum value in the density set, p max Is the maximum value in the density set, ρ' (p) j ) Is the normalized density.
Arranging the normalized densities of all positive samples in the second-level unbalanced data set from large to small to form a density set; density set H = { ρ' (p) 1 ),ρ′(p 2 ),…,ρ′(p min date ) Where ρ' (p) 1 )、ρ′(p 2 )、ρ′(p min date ) The 1 st, 2 nd and min date densities, rho' (p), in the density set H 1 )>ρ′(p 2 )>ρ′(p min date )。
Initializing a second new positive sample set as an empty set;
let l =1;
determining a positive sample corresponding to the ith density in the density set; sequentially traversing from a sample point corresponding to a first density value in a density set H;
using the formula m l =κ×ρ(p l ) Determining the positive sample p corresponding to the ith density in the density set l Number m of samples l
Obtaining positive samples p in a second level imbalance dataset l Nearest m l Each neighboring positive sample;
from positive samples p l And m l Each neighboring positive sample using a formula
Figure GDA0003964669830000131
Obtaining m l A new sample;
m is to be l Adding the new samples into a second new positive sample set to obtain an updated second new positive sample set;
merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;
judging whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a fourth judgment result;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, the expanded second-level unbalanced data set is output;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, and a fifth judgment result is obtained;
if the fifth judgment result shows yes, outputting the expanded second-level unbalanced data set;
if the fifth judgment result shows that the density is not the same as the first density, increasing the value l by 1, and returning to the step of determining a positive sample corresponding to the first density in the density set;
wherein l is more than or equal to 1 and less than or equal to min date, p j For the jth positive sample, p, in the second level unbalanced data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the second level unbalanced dataset j Min date is the number of positive samples in the second level unbalanced data set, dis (p) j ,p g ) For positive samples p in the unbalanced dataset of the second level j And positive sample p g The Euclidean distance between the two electrodes,
Figure GDA0003964669830000141
n is the dimension of the positive sample, p jk Is a positive sample p j Of the k-th dimension, p gk Is a positive sample p g K, is an adjustable coefficient,
Figure GDA0003964669830000142
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated by the b-th neighbor positive sample in the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
The processed positive sample data P and negative sample data N are obtained through the steps.
Step S107 is followed by:
firstly, an MES system balance data set is stored, and then a visualization tool is used for visualizing the data so as to evaluate the MES data and the like.
The invention produces the following advantages:
1. the invention adopts different data expansion methods according to the data unbalance degree in the MES interconnection and intercommunication system. When the data is slightly unbalanced, the purpose of data balance can be achieved only by carrying out a small amount of data expansion on part of the data, so that the existing Borderline-SMOTE algorithm is adopted to expand the positive sample points at the boundary, the problem of boundary ambiguity caused by traditional oversampling can be effectively avoided, the data quality is improved to a certain extent, and the improvement of the classification accuracy rate is helped to a certain extent.
2. In the invention, when severe unbalanced data are faced, the SMOTE algorithm based on density is adopted for data oversampling. The density formula can well represent the density degree of similar samples near the sample points, and in the classification process, the density can reflect the importance degree of the sample points in the samples to a certain degree, so the density is used as a coefficient for data expansion, and different numbers of adjacent samples are selected for different sample points to expand by combining the traditional SMOTE algorithm. And for the part of the positive sample point set which is more than the negative sample point, randomly deleting the newly generated sample points, and keeping the information in the original data.
The invention also provides a small sample data expansion system facing unbalanced data, which comprises:
the sample data set forming module is used for extracting the MES system unbalanced data from an upper-layer platform of the MES interconnection and interworking system and forming the MES system unbalanced data into a sample data set; the sample data set comprises a positive sample data set and a negative sample data set;
the first judgment result obtaining module is used for obtaining the difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judging whether the difference value is smaller than a difference value threshold value or not and obtaining a first judgment result;
the first grade unbalanced data set judging module is used for judging that the positive sample data set is the first grade unbalanced data set if the first judgment result shows that the positive sample data set is the first grade unbalanced data set;
the first expanded positive sample data set obtaining module is used for expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;
the second-level unbalanced data set judging module is used for judging that the positive sample data set is the second-level unbalanced data set if the first judgment result shows that the positive sample data set is not the second-level unbalanced data set;
a second extended positive sample data set obtaining module, configured to extend the second-level unbalanced data set by using a density-based SMOTE algorithm to obtain an extended positive sample data set;
and the MES system balance data set forming module is used for forming an MES system balance data set by the expanded positive sample data set and the expanded negative sample data set.
The first augmented positive sample data set obtaining module specifically includes:
the neighbor sample obtaining submodule is used for obtaining K neighbor samples, which are closest to each positive sample in the sample data set, of each positive sample in the first-level unbalanced data set;
the quantity counting submodule is used for respectively counting the quantity of the neighbor positive samples and the quantity of the neighbor negative samples in the K nearest neighbor samples of each positive sample;
the boundary positive sample determining submodule is used for determining the positive samples corresponding to the positive samples of which the number of the neighbor positive samples is less than that of the neighbor negative samples and the number of the neighbor positive samples is more than 0 as boundary positive samples of the positive and negative sample boundaries;
a new positive sample obtaining submodule for utilizing a formula according to the boundary positive sample and the nearest K neighbor samples of the boundary positive sample
Figure GDA0003964669830000161
Obtaining a new positive sample of each boundary positive sample in the first level unbalanced data set;
the first new positive sample set forming submodule is used for forming a first new positive sample set by new positive samples of all boundary positive samples in the first-level unbalanced data set;
the expanded first-level unbalanced data set obtaining submodule is used for merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;
the second judgment result obtaining submodule is used for judging whether the number of positive samples in the expanded first-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not and obtaining a second judgment result;
the deleted expanded first-level unbalanced data set output sub-module is used for randomly deleting new positive samples in the expanded first-level unbalanced data set if the second judgment result shows that the new positive samples are positive, enabling the number of the positive samples in the deleted expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded first-level unbalanced data set;
a third judgment result obtaining submodule, configured to, if the second judgment result indicates no, judge whether the number of positive samples in the expanded first-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a third judgment result;
the expanded positive sample data set output submodule is used for outputting an expanded first-level unbalanced data set if the third judgment result shows that the data set is positive;
an updating submodule, configured to update the first-level unbalanced data set to the expanded first-level unbalanced data set if the third determination result indicates that the first-level unbalanced data set is negative, and return to the step of "obtaining K nearest neighbor samples, in the sample data set, of each positive sample in the first-level unbalanced data set, where the K nearest neighbor samples are closest to each positive sample";
wherein p is i For the ith boundary positive sample,
Figure GDA0003964669830000162
is a boundary positive sample p i The d nearest positive sample in the K nearest neighbor samples, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i The number of nearest positive samples, p, of the K nearest neighbor samples inew,d Is a positive sample p i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).
The second extended positive sample data set obtaining module specifically includes:
a density determination submodule for utilizing a formula
Figure GDA0003964669830000163
Determining a density of each positive sample in the second level imbalance dataset;
the normalized density obtaining submodule is used for normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;
the density set forming submodule is used for arranging the normalized densities of all the positive samples in the second-level unbalanced data set from large to small to form a density set;
a second new positive sample set initialization submodule, configured to initialize the second new positive sample set to an empty set;
an initial value setting submodule for letting l =1;
the positive sample determining submodule corresponding to the density is used for determining a positive sample corresponding to the ith density in the density set;
a sub-module for determining the number of samples for using the formula m l =κ×ρ(p l ) Determining the positive sample p corresponding to the ith density in the density set l Number m of samples l
A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level unbalanced dataset l Nearest m l A neighboring positive sample;
a new sample obtaining submodule for obtaining a new sample from the positive sample p l And m l Individual neighbor positive samples, using the formula
Figure GDA0003964669830000171
Obtaining m l A new sample;
an updated second new positive sample set obtaining submodule for obtaining m l Adding the new samples into a second new positive sample set to obtain an updated second new positive sample set;
the expanded second-level unbalanced data set obtaining submodule is used for merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;
a fourth judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a fourth judgment result;
the deleted expanded second-level unbalanced data set output submodule is used for randomly deleting new positive samples in the expanded second-level unbalanced data set if the fourth judgment result shows that the number of the positive samples in the deleted expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded second-level unbalanced data set;
a fifth judgment result obtaining submodule, configured to, if the fourth judgment result indicates no, judge whether the number of positive samples in the expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a fifth judgment result;
the expanded second positive sample data set output submodule is used for outputting an expanded second-level unbalanced data set if the fifth judgment result shows that the data set is positive;
a returning step submodule, configured to increase l by 1 if the fifth determination result indicates that the density is negative, and return to the step "determine a positive sample corresponding to the ith density in the density set";
wherein l is more than or equal to 1 and less than or equal to min date, p j For the jth positive sample, p, in the second level unbalanced data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the second level unbalanced dataset j Min date is the number of positive samples in the unbalanced data set of the second level, dis (p) j ,p g ) For positive samples p in the second level unbalanced dataset j And positive sample p g The Euclidean distance between the two electrodes,
Figure GDA0003964669830000181
n is the dimension of the positive sample, p jk Is a positive sample p j Data of the k-th dimension of (1), p gk Is a positive sample p g K, is an adjustable coefficient,
Figure GDA0003964669830000182
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated by the b-th neighbor positive sample in the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
The system further comprises: and the visualization and storage module is used for visualizing and storing the MES system balance data set.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (6)

1. An unbalanced data-oriented small sample data expansion method, characterized in that the method comprises:
extracting MES system unbalanced data from an upper platform of an MES interconnection and intercommunication system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;
obtaining a difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, and judging whether the difference value is smaller than a difference value threshold value to obtain a first judgment result;
if the first judgment result shows that the data set is the first-level unbalanced data set, judging that the positive sample data set is the first-level unbalanced data set;
expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;
if the first judgment result shows that the data set is not the second-level unbalanced data set, judging that the positive sample data set is the second-level unbalanced data set;
expanding the second-level unbalanced data set by adopting a density-based SMOTE algorithm to obtain an expanded positive sample data set;
the expanded positive sample data set and the negative sample data set form an MES system balance data set;
storing the MES system balanced data set, and then visualizing the data by utilizing a visualization tool to evaluate the MES data;
the method for obtaining the expanded positive sample data set includes the steps that a Borderline-SMOTE algorithm is adopted to expand the first-level unbalanced data set, and the method specifically includes the following steps:
acquiring K nearest neighbor samples of each positive sample in the first-level unbalanced data set and each positive sample in the sample data set;
respectively counting the number of neighbor positive samples and the number of neighbor negative samples in the K nearest neighbor samples of each positive sample;
determining positive samples corresponding to the positive samples with the number of the adjacent positive samples being less than the number of the adjacent negative samples and the number of the adjacent positive samples being more than 0 as boundary positive samples of the positive and negative sample boundaries;
according to the boundary positive sample and the nearest neighbor positive sample in the K neighbor samples of the boundary positive sample, utilizing a formula
Figure FDA0003964669820000021
Obtaining a new positive sample for each boundary positive sample in the first level imbalance dataset;
new positive samples of all boundary positive samples in the first level unbalanced data set form a first new positive sample set;
merging the first new positive sample set and the first level unbalanced data set to obtain an expanded first level unbalanced data set;
judging whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a second judgment result;
if the second judgment result shows that the first-level unbalanced data set is the original first-level unbalanced data set, randomly deleting new positive samples in the expanded first-level unbalanced data set, enabling the number of the positive samples in the expanded first-level unbalanced data set after deletion to be equal to the number of the negative samples in the negative sample data set, and outputting the expanded first-level unbalanced data set after deletion;
if the second judgment result shows that the first level unbalanced data set is not the same as the second level unbalanced data set, judging whether the number of positive samples in the expanded first level unbalanced data set is equal to the number of negative samples in the negative sample data set or not, and obtaining a third judgment result;
if the third judgment result shows that the data set is true, outputting the expanded first-level unbalanced data set;
if the third judgment result shows that the first-level unbalanced data set is not the expanded first-level unbalanced data set, returning to the step of obtaining K nearest neighbor samples of each positive sample in the sample data set in the first-level unbalanced data set, wherein the K nearest neighbor samples are the nearest to each positive sample in the sample data set;
wherein p is i For the ith boundary positive sample,
Figure FDA0003964669820000022
is a boundary positive sample p i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i The number of nearest positive samples, p, of the K nearest neighbor samples inew,d Is a positive sample p i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).
2. The imbalance-data-oriented small sample data expansion method according to claim 1, wherein the expanding the second-level imbalance data set by using a density-based SMOTE algorithm to obtain an expanded positive sample data set specifically comprises:
using formulas
Figure FDA0003964669820000031
Determining a density of each positive sample in the second level imbalance dataset;
normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;
arranging the normalized densities of all positive samples in the second-level unbalanced data set from large to small to form a density set;
initializing a second new positive sample set as an empty set;
let l =1;
determining a positive sample corresponding to the ith density in the density set;
using the formula m l =κ×ρ(p l ) Determining a positive sample p corresponding to the ith density in the density set l Number m of samples l
Obtaining a positive sample p in the second level imbalance dataset l Nearest m l A neighboring positive sample;
according to the positive sample p l And m l Each neighboring positive sample using a formula
Figure FDA0003964669820000032
Obtaining m l A new sample;
m is to be l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;
merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;
judging whether the number of positive samples in the expanded second-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not, and obtaining a fourth judgment result;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, the deleted expanded second-level unbalanced data set is output;
if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is not equal to the number of the negative samples in the negative sample data set, and a fifth judgment result is obtained;
if the fifth judgment result shows that the data set is the second-level unbalanced data set, outputting the expanded second-level unbalanced data set;
if the fifth judgment result shows that the density set is not the first density set, increasing the value l by 1, and returning to the step of determining a positive sample corresponding to the first density set;
wherein l is more than or equal to 1 and less than or equal to mindate, p j For the jth positive sample, p, in the second level unbalanced data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the unbalanced dataset of the second level j Density of (d), mindate is the number of positive samples in the second level unbalanced data set, dis (p) j ,p g ) For positive samples p in the unbalanced dataset of the second level j And positive sample p g The Euclidean distance between the two electrodes,
Figure FDA0003964669820000041
n is the dimension of the positive sample, p jk Is a positive sample p j Of the k-th dimension, p gk Is a positive sample p g K, is an adjustable coefficient,
Figure FDA0003964669820000042
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated from the b-th neighbor of the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
3. The unbalanced-data-oriented small sample data expansion method of claim 1, wherein the expanded positive sample data set and the negative sample data set constitute an MES system balanced data set, and then further comprising:
and visualizing and storing the MES system balance data set.
4. An unbalanced data oriented small sample data augmentation system, the system comprising:
the sample data set forming module is used for extracting the MES system unbalanced data from an upper-layer platform of the MES interconnection and interworking system and forming the MES system unbalanced data into a sample data set; the sample data set comprises a positive sample data set and a negative sample data set;
a first judgment result obtaining module, configured to obtain a difference between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judge whether the difference is smaller than a difference threshold, and obtain a first judgment result;
the first-grade unbalanced data set judging module is used for judging that the positive sample data set is a first-grade unbalanced data set if the first judgment result shows that the positive sample data set is positive;
a first extended positive sample data set obtaining module, configured to extend the first-level unbalanced data set by using a Borderline-SMOTE algorithm to obtain an extended positive sample data set;
the second-level unbalanced data set judging module is used for judging that the positive sample data set is a second-level unbalanced data set if the first judgment result shows that the positive sample data set is not the first-level unbalanced data set;
a second extended positive sample data set obtaining module, configured to extend the second level unbalanced data set by using a density-based SMOTE algorithm to obtain an extended positive sample data set;
an MES system balance data set forming module, which is used for forming an MES system balance data set by the extended positive sample data set and the negative sample data set;
wherein, the first extended positive sample data set obtaining module specifically includes:
a neighbor sample obtaining sub-module, configured to obtain K nearest neighbor samples of each positive sample in the first-level unbalanced data set to each positive sample in the sample data set;
the quantity counting submodule is used for respectively counting the quantity of the neighbor positive samples and the quantity of the neighbor negative samples in the K nearest neighbor samples of each positive sample;
the boundary positive sample determining submodule is used for determining the positive samples corresponding to the positive samples of which the number of the neighbor positive samples is less than that of the neighbor negative samples and the number of the neighbor positive samples is more than 0 as boundary positive samples of the positive and negative sample boundaries;
a new positive sample obtaining submodule for utilizing a formula according to the boundary positive sample and the nearest positive sample in the K adjacent samples of the boundary positive sample
Figure FDA0003964669820000051
Obtaining a new positive sample for each boundary positive sample in the first level imbalance dataset;
a first new positive sample set forming submodule, configured to form a first new positive sample set from new positive samples of all boundary positive samples in the first level unbalanced data set;
the expanded first-level unbalanced data set obtaining submodule is used for merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;
a second judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a second judgment result;
the deleted expanded first-level unbalanced data set output sub-module is used for randomly deleting new positive samples in the expanded first-level unbalanced data set if the second judgment result shows that the new positive samples are positive, enabling the number of the positive samples in the deleted expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded first-level unbalanced data set;
a third determination result obtaining sub-module, configured to determine, if the second determination result indicates no, whether the number of positive samples in the expanded first-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a third determination result;
the expanded positive sample data set output submodule is used for outputting the expanded first-level unbalanced data set if the third judgment result shows that the expanded first-level unbalanced data set is true;
an updating submodule, configured to update the first-level unbalanced data set to an expanded first-level unbalanced data set if the third determination result indicates that the first-level unbalanced data set is negative, and return to the step of "obtaining K nearest neighbor samples, in the sample data set, of each positive sample in the first-level unbalanced data set, where the K nearest neighbor samples are closest to each positive sample";
wherein p is i For the ith boundary positive sample,
Figure FDA0003964669820000062
is a boundary positive sample p i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p i Number of nearest neighbor samples, m i Is a boundary positive sample p i Number of nearest positive samples, p, of nearest K nearest neighbor samples inew,d Is a positive sample p i The rand () is a random function, and the rand (0, 1) is a random number generated within (0, 1).
5. The unbalanced-data-oriented small sample data expansion system of claim 4, wherein the second expanded positive sample data set obtaining module specifically comprises:
a density determination submodule for utilizing a formula
Figure FDA0003964669820000061
Determining a density of each positive sample in the second level imbalance dataset;
the normalized density obtaining submodule is used for normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;
the density set forming submodule is used for arranging the normalized densities of all the positive samples in the second-level unbalanced data set from large to small to form a density set;
a second new positive sample set initialization submodule, configured to initialize the second new positive sample set to an empty set;
an initial value setting submodule for letting l =1;
the positive sample determining submodule corresponding to the density is used for determining a positive sample corresponding to the ith density in the density set;
a sub-module for determining the number of samples for using the formula m l =κ×ρ(p l ) Determining a positive sample p corresponding to the ith density in the density set l Number m of samples l
A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level imbalance dataset l Nearest m l A neighboring positive sample;
a new sample obtaining submodule for obtaining a new sample from the positive sample p l And m l Individual neighbor positive samples, using the formula
Figure FDA0003964669820000071
Obtaining m l A new sample;
an updated second new positive sample set obtaining submodule for obtaining m l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;
an expanded second-level unbalanced data set obtaining submodule, configured to merge the updated second new positive sample set and the second-level unbalanced data set, so as to obtain an expanded second-level unbalanced data set;
a fourth judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a fourth judgment result;
a deleted expanded second-level unbalanced data set output submodule, configured to randomly delete a new positive sample in the expanded second-level unbalanced data set if the fourth determination result indicates that the number of positive samples in the deleted expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and output the deleted expanded second-level unbalanced data set;
a fifth judgment result obtaining sub-module, configured to, if the fourth judgment result indicates no, judge whether the number of positive samples in the expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a fifth judgment result;
the expanded second positive sample data set output submodule is used for outputting the expanded second-level unbalanced data set if the fifth judgment result shows that the expanded second positive sample data set is positive;
a returning step submodule, configured to increase l by 1 if the fifth determination result indicates that the density is negative, and return to the step "determine a positive sample corresponding to the ith density in the density set";
wherein l is more than or equal to 1 and less than or equal to mindate, p j For the jth positive sample, p, in the second level unbalanced data set g For the g-th positive sample in the second level unbalanced data set, ρ (p) j ) For positive samples p in the second level unbalanced dataset j Density of (d), mindate is the number of positive samples in the second level unbalanced data set, dis (p) j ,p g ) For positive samples p in the second level unbalanced dataset j And positive sample p g The Euclidean distance between the two electrodes,
Figure FDA0003964669820000081
n is the dimension of the positive sample, p jk Is a positive sample p j Of the k-th dimension, p gk Is a positive sample p g K, is an adjustable coefficient,
Figure FDA0003964669820000082
is m l B-th neighbor positive sample, p, of the neighbor positive samples inew,b Is a positive sample p l And m l The new sample generated by the b-th neighbor positive sample in the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).
6. The imbalance data-oriented small sample data augmentation system of claim 4, further comprising:
and the visualization and storage module is used for visualizing and storing the MES system balance data set.
CN202011384923.3A 2020-12-01 2020-12-01 Small sample data expansion method and system for unbalanced data Active CN112396124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011384923.3A CN112396124B (en) 2020-12-01 2020-12-01 Small sample data expansion method and system for unbalanced data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011384923.3A CN112396124B (en) 2020-12-01 2020-12-01 Small sample data expansion method and system for unbalanced data

Publications (2)

Publication Number Publication Date
CN112396124A CN112396124A (en) 2021-02-23
CN112396124B true CN112396124B (en) 2023-01-24

Family

ID=74604068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011384923.3A Active CN112396124B (en) 2020-12-01 2020-12-01 Small sample data expansion method and system for unbalanced data

Country Status (1)

Country Link
CN (1) CN112396124B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11392846B2 (en) * 2019-05-24 2022-07-19 Canon U.S.A., Inc. Local-adapted minority oversampling strategy for highly imbalanced highly noisy dataset

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ADASYN和SMOTE相结合的不平衡数据分类算法;蒋华等;《计算机仿真》;20200315(第03期);全文 *
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning;Hui Han 等;《Advances in Intelligent Computing》;20050823;全文 *
MEMS Gyroscope Noise Analysis and Calibration Using Allan Variance and Improved Hann Filter;Zhanyi Yan 等;《Proceedings of 2020 Chinese Intelligent Systems Conference》;20200930;全文 *
SMOTE: Synthetic Minority Over-sampling Technique;N. V. Chawla 等;《arXiv:1106.1813》;20110609;全文 *
不平衡数据分类问题解决办法;季晨雨;《电子技术与软件工程》;20180810(第15期);全文 *
基于改进Border-SMOTE的不平衡数据工业控制系统入侵检测;张晓宇等;《信息网络安全》;20200710(第07期);全文 *
基于类中心插值的非均衡数据分类算法;齐利泉;《通信技术》;20190310(第03期);全文 *
面向不平衡数据集的一种精化Borderline-SMOTE方法;杨毅等;《复旦学报(自然科学版)》;20171015(第05期);全文 *

Also Published As

Publication number Publication date
CN112396124A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN111882446B (en) Abnormal account detection method based on graph convolution network
WO2018103456A1 (en) Method and apparatus for grouping communities on the basis of feature matching network, and electronic device
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
CN111754345B (en) Bit currency address classification method based on improved random forest
CN110414780B (en) Fraud detection method based on generation of financial transaction data against network
CN111881289B (en) Training method of classification model, and detection method and device of data risk class
CN110287292B (en) Judgment criminal measuring deviation degree prediction method and device
CN110991474A (en) Machine learning modeling platform
KR102456987B1 (en) Method for detecting crack of exterior wall
CN110866832A (en) Risk control method, system, storage medium and computing device
CN113052577A (en) Method and system for estimating category of virtual address of block chain digital currency
CN111047428B (en) Bank high-risk fraud customer identification method based on small amount of fraud samples
CN112182056A (en) Data detection method, device, equipment and storage medium
CN112396124B (en) Small sample data expansion method and system for unbalanced data
CN110059126B (en) LKJ abnormal value data-based complex correlation network analysis method and system
CN116821759A (en) Identification prediction method and device for category labels, processor and electronic equipment
CN117009613A (en) Picture data classification method, system, device and medium
CN111292182A (en) Credit fraud detection method and system
CN105930430A (en) Non-cumulative attribute based real-time fraud detection method and apparatus
CN109829500B (en) Position composition and automatic clustering method
CN112950350B (en) Loan product recommendation method and system based on machine learning
CN112632219B (en) Method and device for intercepting junk short messages
CN115022049A (en) Distributed external network traffic data detection method based on Mahalanobis distance calculation, electronic device and storage medium
CN112085586B (en) Bank credit card anti-cash registering method based on dense subgraph
CN114417958A (en) Unbalanced financial data credit evaluation method based on improved graph convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant