CN112396124B

CN112396124B - Small sample data expansion method and system for unbalanced data

Info

Publication number: CN112396124B
Application number: CN202011384923.3A
Authority: CN
Inventors: 柴森春; 王昭洋; 周泰民; 崔灵果; 李慧芳; 姚分喜; 张百海
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2023-01-24
Anticipated expiration: 2040-12-01
Also published as: CN112396124A

Abstract

The invention relates to a small sample data expansion method and system facing unbalanced data, wherein the unbalanced data of an MES system is extracted from an upper-layer platform of an MES interconnection and intercommunication system, the level of the unbalanced data of the MES system is divided according to the difference value of the number of positive samples in a positive sample data set and the number of negative samples in a negative sample data set, different expansion methods are adopted for different levels of the unbalanced data of the MES system, when the positive sample data set is a first-level unbalanced data set, a Borderline-SMOTE algorithm is adopted to expand positive sample points at the boundary, the boundary fuzzy problem caused by traditional oversampling can be effectively avoided, and the data quality is improved to a certain extent; when the positive sample data set is the second-level unbalanced data set, the SMOTE algorithm based on the density is adopted for expansion, compared with the traditional SMOTE algorithm, the problem of boundary ambiguity is reduced, and the quality of small sample data expansion is ensured.

Description

Small sample data expansion method and system for unbalanced data

Technical Field

The invention relates to the technical field of data expansion, in particular to a small sample data expansion method and system for unbalanced data.

Background

With the continuous development of artificial intelligence, machine learning technology is now deeply applied to industrial production. However, when the machine learning technology is applied to deal with the problems of classification, regression and the like in the production flow, the problem of data imbalance is often faced. For example, the number of failure data in the failure diagnosis problem in the industrial field is far smaller than that of normal data, the number of cases in the medical field diagnosis is small, credit card transaction fraud prediction in the financial field, network intrusion prediction in the network security field and the like all bring certain difficulties due to data imbalance. In the problem of data imbalance, most basic models such as a majority (negative sample) sample number far greater than a minority (positive sample) sample number are prone to being inclined to parameter updating of the majority sample and neglecting correct classification of the minority sample in the process of achieving the goal of maximizing overall classification accuracy, so that the minority sample is difficult to learn by a classifier, and the classification accuracy of the minority sample is often concerned more. Therefore, certain techniques are needed to perform reasonable data expansion on a small number of classes of samples in the MES system.

Currently, under-sampling, over-sampling methods and ensemble learning methods are widely used to alleviate the training problem of unbalanced data sets at the data sampling level and the algorithm optimization level, respectively. The common oversampling technique (sampling technique) balances the number of samples of each fault class by amplifying a few classes of faults, but simple copy samples easily cause overfitting to severely unbalanced data in the MES system, and newly generated samples easily cause the defect of overlapping among fault sample classes, so that an oversampling algorithm cannot ensure the quality of the amplified data to a certain extent.

Disclosure of Invention

The invention aims to provide a small sample data expansion method and system for unbalanced data, so as to overcome the defects of overfitting of small sample data expansion and data overlapping between newly generated samples in the prior art and ensure the quality of small sample data expansion.

In order to achieve the purpose, the invention provides the following scheme:

a small sample data expansion method oriented to unbalanced data, the method comprising:

extracting MES system unbalanced data from an upper platform of an MES interconnection and interworking system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;

obtaining a difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, and judging whether the difference value is smaller than a difference value threshold value to obtain a first judgment result;

if the first judgment result shows that the data set is the first-level unbalanced data set, judging that the positive sample data set is the first-level unbalanced data set;

adopting a Borderline-SMOTE algorithm to expand the first-level unbalanced data set to obtain an expanded positive sample data set;

if the first judgment result shows that the data set is not the first-level unbalanced data set, judging that the first-level unbalanced data set is a second-level unbalanced data set;

expanding the second-level unbalanced data set by adopting a density-based SMOTE algorithm to obtain an expanded positive sample data set;

and the expanded positive sample data set and the negative sample data set form an MES system balanced data set.

Optionally, the expanding the first-level unbalanced data set by using a Borderline-SMOTE algorithm to obtain an expanded positive sample data set specifically includes:

acquiring K nearest neighbor samples of each positive sample in the first-level unbalanced data set and each positive sample in the sample data set;

respectively counting the number of neighbor positive samples and the number of neighbor negative samples in the K nearest neighbor samples of each positive sample;

determining positive samples corresponding to the positive samples with the number of the adjacent positive samples being less than the number of the adjacent negative samples and the number of the adjacent positive samples being more than 0 as boundary positive samples of the positive and negative sample boundaries;

according to the boundary positive sample and the nearest neighbor positive sample in the K neighbor samples of the boundary positive sample, utilizing a formula

Obtaining a new positive sample for each boundary positive sample in the first level imbalance dataset;

new positive samples of all boundary positive samples in the first level unbalanced data set constitute a first new positive sample set;

merging the first new positive sample set and the first level unbalanced data set to obtain an expanded first level unbalanced data set;

judging whether the number of positive samples in the expanded first-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not, and obtaining a second judgment result;

if the second judgment result shows that the first-level unbalanced data set is the original first-level unbalanced data set, randomly deleting new positive samples in the expanded first-level unbalanced data set, enabling the number of the positive samples in the expanded first-level unbalanced data set after deletion to be equal to the number of the negative samples in the negative sample data set, and outputting the expanded first-level unbalanced data set after deletion;

if the second judgment result shows that the first level unbalanced data set is not the same as the second level unbalanced data set, judging whether the number of positive samples in the expanded first level unbalanced data set is equal to the number of negative samples in the negative sample data set or not, and obtaining a third judgment result;

if the third judgment result shows that the data set is true, outputting the expanded first-level unbalanced data set;

if the third judgment result shows that the first-level unbalanced data set is not the expanded first-level unbalanced data set, returning to the step of obtaining K nearest neighbor samples of each positive sample in the sample data set in the first-level unbalanced data set, wherein the K nearest neighbor samples are the nearest to each positive sample in the sample data set;

wherein p is _i For the ith boundary positive sample,

is a boundary positive sample p _i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p _i Number of nearest neighbor samples, m _i Is a boundary positive sample p _i The number of nearest positive samples, p, of the K nearest neighbor samples _inew,d Is a positive sample p _i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).

Optionally, the expanding the second-level unbalanced data set by using a density-based SMOTE algorithm to obtain an expanded positive sample data set specifically includes:

using formulas

Determining a density of each positive sample in the second level imbalance dataset;

normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;

arranging the normalized densities of all positive samples in the second-level unbalanced data set from large to small to form a density set;

initializing a second new positive sample set as an empty set;

let l =1;

determining a positive sample corresponding to the ith density in the density set;

using the formula m _l ＝κ×ρ(p _l ) DeterminingPositive sample p corresponding to the ith density in the density set _l Number m of samples _l ；

Obtaining positive samples p in the second level imbalance dataset _l Nearest m _l A neighboring positive sample;

according to the positive sample p _l And m _l Individual neighbor positive samples, using the formula

Obtaining m _l A new sample;

m is to be _l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;

merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;

judging whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a fourth judgment result;

if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, randomly deleting new positive samples in the expanded second-level unbalanced data set, and outputting the deleted expanded second-level unbalanced data set;

if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is not equal to the number of the negative samples in the negative sample data set, and a fifth judgment result is obtained;

if the fifth judgment result shows that the data set is the second-level unbalanced data set, outputting the expanded second-level unbalanced data set;

if the fifth judgment result shows that the density set is not the first density set, increasing the value l by 1, and returning to the step of determining a positive sample corresponding to the first density set;

wherein l is more than or equal to 1 and less than or equal to min date, p _j Is a second level of unevennessWeighing the jth positive sample, p, in the data set _g For the g-th positive sample in the second level unbalanced data set, ρ (p) _j ) For positive samples p in the second level unbalanced dataset _j Min date is the number of positive samples in the second level unbalanced data set, dis (p) _j ,p _g ) For positive samples p in the second level unbalanced dataset _j And positive sample p _g The Euclidean distance between the two electrodes,

n is the dimension of the positive sample, p _jk Is a positive sample p _j Of the k-th dimension, p _gk Is a positive sample p _g K, is an adjustable coefficient,

is m _l B-th neighbor positive sample, p, of the neighbor positive samples _inew,b Is a positive sample p _l And m _l The new sample generated from the b-th neighbor of the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).

Optionally, the augmented positive sample data set and the negative sample data set constitute an MES system balancing data set, and then the method further includes:

and visualizing and storing the MES system balance data set.

An unbalanced data oriented small sample data augmentation system, the system comprising:

the system comprises a sample data set forming module, a data processing system (MES) data set analyzing module and a data processing module, wherein the sample data set forming module is used for extracting the MES system unbalanced data from an upper platform of an MES interconnection and intercommunication system and forming the MES system unbalanced data into a sample data set; the sample data set comprises a positive sample data set and a negative sample data set;

a first judgment result obtaining module, configured to obtain a difference between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judge whether the difference is smaller than a difference threshold, and obtain a first judgment result;

the first-grade unbalanced data set judging module is used for judging that the positive sample data set is a first-grade unbalanced data set if the first judgment result shows that the positive sample data set is positive;

a first extended positive sample data set obtaining module, configured to extend the first-level unbalanced data set by using a Borderline-SMOTE algorithm to obtain an extended positive sample data set;

a second-level unbalanced data set determination module, configured to determine that the positive sample data set is a second-level unbalanced data set if the first determination result indicates no;

a second extended positive sample data set obtaining module, configured to extend the second level unbalanced data set by using a density-based SMOTE algorithm to obtain an extended positive sample data set;

and the MES system balanced data set forming module is used for forming an MES system balanced data set by the expanded positive sample data set and the negative sample data set.

Optionally, the first extended positive sample data set obtaining module specifically includes:

a neighbor sample obtaining sub-module, configured to obtain K nearest neighbor samples of each positive sample in the first-level unbalanced data set to each positive sample in the sample data set;

the quantity counting submodule is used for respectively counting the quantity of the neighbor positive samples and the quantity of the neighbor negative samples in the K nearest neighbor samples of each positive sample;

the boundary positive sample determining submodule is used for determining the positive samples corresponding to the positive samples of which the number of the neighbor positive samples is less than that of the neighbor negative samples and the number of the neighbor positive samples is more than 0 as boundary positive samples of the positive and negative sample boundaries;

a new positive sample obtaining sub-module for obtaining a positive sample according to the boundary positive sample and the nearest K neighboring samples of the boundary positive sample by using a formula

Obtaining each boundary positive sample in the first level unbalanced data setA new positive sample of the book;

a first new positive sample set forming submodule, configured to form a first new positive sample set from new positive samples of all boundary positive samples in the first level unbalanced data set;

the expanded first-level unbalanced data set obtaining submodule is used for merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;

a second judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a second judgment result;

the deleted expanded first-level unbalanced data set output sub-module is used for randomly deleting new positive samples in the expanded first-level unbalanced data set if the second judgment result shows that the new positive samples are positive, enabling the number of the positive samples in the deleted expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded first-level unbalanced data set;

a third determination result obtaining sub-module, configured to determine, if the second determination result indicates no, whether the number of positive samples in the expanded first-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a third determination result;

the expanded positive sample data set output submodule is used for outputting the expanded first-level unbalanced data set if the third judgment result shows that the expanded first-level unbalanced data set is true;

an updating submodule, configured to update the first-level unbalanced data set to an expanded first-level unbalanced data set if the third determination result indicates that the data set is not the first-level unbalanced data set, returning to the step of obtaining K nearest neighbor samples of each positive sample in the sample data set to each positive sample in the first-level unbalanced data set;

wherein p is _i For the ith boundary positive sample, the first boundary,

is a boundary positive sample p _i The d nearest positive sample in the K nearest neighbor samples, K is the boundary positive sample p _i Number of nearest neighbor samples, m _i Is a boundary positive sample p _i The number of nearest positive samples, p, of the K nearest neighbor samples _inew,d Is a positive sample p _i The new positive sample of (1), rand () is a random function, and rand (0, 1) is a random number generated within (0, 1).

Optionally, the module for obtaining the second extended positive sample data set specifically includes:

a density determination submodule for utilizing a formula

the normalized density obtaining submodule is used for normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set;

the density set forming submodule is used for arranging the normalized densities of all the positive samples in the second-level unbalanced data set from large to small to form a density set;

a second new positive sample set initialization submodule, configured to initialize the second new positive sample set to an empty set;

an initial value setting submodule for letting l =1;

the positive sample determining submodule corresponding to the density is used for determining a positive sample corresponding to the ith density in the density set;

a sub-module for determining the number of samples for using the formula m _l ＝κ×ρ(p _l ) Determining a positive sample p corresponding to the ith density in the density set _l Number m of samples _l ；

A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level imbalance dataset _l More recently, the development of new and more sophisticated displaysM of _l A neighboring positive sample;

a new sample obtaining submodule for obtaining a new sample from the positive sample p _l And m _l Individual neighbor positive samples, using the formula

Obtaining m _l A new sample;

an updated second new positive sample set obtaining submodule for obtaining m _l Adding the new samples into the second new positive sample set to obtain an updated second new positive sample set;

an expanded second-level unbalanced data set obtaining submodule, configured to merge the updated second new positive sample set and the second-level unbalanced data set, so as to obtain an expanded second-level unbalanced data set;

a fourth determination result obtaining sub-module, configured to determine whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a fourth determination result;

a deleted expanded second-level unbalanced data set output submodule, configured to randomly delete a new positive sample in the expanded second-level unbalanced data set if the fourth determination result indicates that the number of positive samples in the deleted expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and output the deleted expanded second-level unbalanced data set;

a fifth judgment result obtaining sub-module, configured to, if the fourth judgment result indicates no, judge whether the number of positive samples in the expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a fifth judgment result;

the expanded second positive sample data set output submodule is used for outputting the expanded second-level unbalanced data set if the fifth judgment result shows that the expanded second positive sample data set is positive;

a returning step submodule, configured to increase l by 1 if the fifth determination result indicates that the density is negative, and return to the step "determine a positive sample corresponding to the ith density in the density set";

wherein l is more than or equal to 1 and less than or equal to min date, p _j For the jth positive sample, p, in the second level unbalanced data set _g For the g-th positive sample in the second level unbalanced data set, ρ (p) _j ) For positive samples p in the second level unbalanced dataset _j Min date is the number of positive samples in the second level unbalanced data set, dis (p) _j ,p _g ) For positive samples p in the unbalanced dataset of the second level _j And positive sample p _g The Euclidean distance between the two electrodes,

is m _l B-th neighbor positive sample, p, of the neighbor positive samples _inew,b Is a positive sample p _l And m _l The new sample generated by the b-th neighbor positive sample in the neighbor positive samples, rand () is a random function, and rand (0, 1) is a random number generated within one (0, 1).

Optionally, the system further includes:

and the visualization and storage module is used for visualizing and storing the MES system balance data set.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a small sample data expansion method and system facing unbalanced data, wherein different expansion methods are adopted for different levels of unbalanced data of an MES system, when a positive sample data set is a first-level unbalanced data set, a Borderline-SMOTE algorithm is adopted to expand positive sample points at a boundary, so that the problem of boundary ambiguity caused by traditional oversampling can be effectively avoided, and the data quality is improved to a certain extent; when the positive sample data set is the second-level unbalanced data set, the SMOTE algorithm based on the density is adopted for expansion, compared with the traditional SMOTE algorithm, the problem of boundary ambiguity is reduced, and the quality of small sample data expansion is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a small sample data expansion method for unbalanced data according to the present invention;

FIG. 2 is a schematic diagram of the Borderline-SMOTE algorithm provided by the present invention;

fig. 3 is a schematic diagram of the SMOTE algorithm provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides a small sample data expansion method facing unbalanced data, as shown in fig. 1, the method comprises:

s101, extracting MES system unbalanced data from an upper-layer platform of an MES interconnection and interworking system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;

s102, obtaining a difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judging whether the difference value is smaller than a difference value threshold value, and obtaining a first judgment result;

s103, if the first judgment result shows that the data set is the first-level unbalanced data set, judging that the positive sample data set is the first-level unbalanced data set;

s104, expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;

s105, if the first judgment result shows that the data set is not the second-level unbalanced data set, judging that the positive sample data set is the second-level unbalanced data set;

s106, expanding the second-level unbalanced data set by adopting a density-based SMOTE algorithm to obtain an expanded positive sample data set;

and S107, the expanded positive sample data set and the expanded negative sample data set form an MES system balance data set.

The specific process is as follows:

step S101, accessing an MES upper layer data management layer to extract data, defaulting the extracted data to data subjected to data preprocessing, and defaulting the MES data to contain two types, namely a majority type (negative sample) and a minority type (positive sample). The positive sample set is denoted as P and the negative sample set is denoted as N.

In step S102, the number of positive samples (denoted as min date) and the number of negative samples (denoted as max date) in the extracted MES data are respectively counted, and if the difference between min date and max date is less than 20%, it is determined that the data set is a slightly unbalanced data set, i.e. a first-level unbalanced data set (step S103). If the difference between min date and max date is not less than 20%, the data set is determined to be a severely unbalanced data set, i.e. a second level unbalanced data set (step S105). And the subsequent data expansion module expands the MES data according to the evaluation result.

Step S104, for slightly unbalanced data, the purpose of data balance can be achieved only by performing a small amount of expansion on the positive sample, and the first-level unbalanced data set is expanded by using the Borderline-SMOTE algorithm to obtain an expanded positive sample data set, as shown in fig. 2, specifically including:

determining positive samples corresponding to the condition that the number of the neighboring positive samples is less than that of the neighboring negative samples and the number of the neighboring positive samples is greater than 0 as boundary positive samples of the boundary of the positive and negative samples;

Obtaining a new positive sample of each boundary positive sample in the first level unbalanced data set;

new positive samples of all boundary positive samples in the first level unbalanced data set form a first new positive sample set;

merging the first new positive sample set and the first-level unbalanced data set to obtain an expanded first-level unbalanced data set;

judging whether the number of positive samples in the expanded first-level unbalanced data set is greater than the number of negative samples in the negative sample data set or not, and obtaining a second judgment result;

if the second judgment result shows that the first-level unbalanced data set is the original first-level unbalanced data set, randomly deleting new positive samples in the expanded first-level unbalanced data set, enabling the number of the positive samples in the expanded first-level unbalanced data set to be equal to the number of the negative samples in the negative sample data set, and outputting the expanded first-level unbalanced data set after deletion;

if the second judgment result shows that the first level unbalanced data set is not the negative sample set, judging whether the number of positive samples in the expanded first level unbalanced data set is equal to the number of negative samples in the negative sample set or not, and obtaining a third judgment result;

if the third judgment result shows that the first grade unbalanced data set is not the expanded first grade unbalanced data set, returning to the step of obtaining K nearest neighbor samples, which are the nearest to each positive sample in the sample data set, of each positive sample in the first grade unbalanced data set;

wherein p is _i For the ith boundary positive sample,

is a boundary positive sample p _i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p _i Number of nearest neighbor samples, m _i Is a boundary positive sample p _i The number of nearest positive samples, p, of the K nearest neighbor samples _inew,d Is a positive sample p _i Is a random function, and rand (0, 1) is a random number generated in one (0, 1), n in FIG. 2 _i Is a boundary positive sample p _i The number of neighbor negative samples in the K neighbor samples nearby;

step S106, expanding the second-level unbalanced data set by using a density-based SMOTE (Synthetic minimum Oversampling Technique) algorithm to obtain an expanded positive sample data set, as shown in fig. 3, which specifically includes:

using a formula

Determining a density of each positive sample in the second level imbalance dataset; by rho (p) _j ) Measure sample point p _j Local density of p (p) _j ) The larger the representative sample point p _j The greater the density in the vicinity, the more concentrated the sample points, ρ (p) _j ) The smaller the representative sample point p _j The smaller the nearby density.

Normalizing the density of each positive sample in the second-level unbalanced data set to obtain the normalized density of each positive sample in the second-level unbalanced data set; the normalization formula used is as follows:

where ρ is _min Is the minimum value in the density set, p _max Is the maximum value in the density set, ρ' (p) _j ) Is the normalized density.

Arranging the normalized densities of all positive samples in the second-level unbalanced data set from large to small to form a density set; density set H = { ρ' (p) ₁ ),ρ′(p ₂ ),…,ρ′(p _{min date} ) Where ρ' (p) ₁ )、ρ′(p ₂ )、ρ′(p _{min date} ) The 1 st, 2 nd and min date densities, rho' (p), in the density set H ₁ )>ρ′(p ₂ )>ρ′(p _{min date} )。

Initializing a second new positive sample set as an empty set;

let l =1;

determining a positive sample corresponding to the ith density in the density set; sequentially traversing from a sample point corresponding to a first density value in a density set H;

using the formula m _l ＝κ×ρ(p _l ) Determining the positive sample p corresponding to the ith density in the density set _l Number m of samples _l ；

Obtaining positive samples p in a second level imbalance dataset _l Nearest m _l Each neighboring positive sample;

from positive samples p _l And m _l Each neighboring positive sample using a formula

Obtaining m _l A new sample;

m is to be _l Adding the new samples into a second new positive sample set to obtain an updated second new positive sample set;

if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, the expanded second-level unbalanced data set is output;

if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, and a fifth judgment result is obtained;

if the fifth judgment result shows yes, outputting the expanded second-level unbalanced data set;

if the fifth judgment result shows that the density is not the same as the first density, increasing the value l by 1, and returning to the step of determining a positive sample corresponding to the first density in the density set;

The processed positive sample data P and negative sample data N are obtained through the steps.

Step S107 is followed by:

firstly, an MES system balance data set is stored, and then a visualization tool is used for visualizing the data so as to evaluate the MES data and the like.

The invention produces the following advantages:

1. the invention adopts different data expansion methods according to the data unbalance degree in the MES interconnection and intercommunication system. When the data is slightly unbalanced, the purpose of data balance can be achieved only by carrying out a small amount of data expansion on part of the data, so that the existing Borderline-SMOTE algorithm is adopted to expand the positive sample points at the boundary, the problem of boundary ambiguity caused by traditional oversampling can be effectively avoided, the data quality is improved to a certain extent, and the improvement of the classification accuracy rate is helped to a certain extent.

2. In the invention, when severe unbalanced data are faced, the SMOTE algorithm based on density is adopted for data oversampling. The density formula can well represent the density degree of similar samples near the sample points, and in the classification process, the density can reflect the importance degree of the sample points in the samples to a certain degree, so the density is used as a coefficient for data expansion, and different numbers of adjacent samples are selected for different sample points to expand by combining the traditional SMOTE algorithm. And for the part of the positive sample point set which is more than the negative sample point, randomly deleting the newly generated sample points, and keeping the information in the original data.

The invention also provides a small sample data expansion system facing unbalanced data, which comprises:

the sample data set forming module is used for extracting the MES system unbalanced data from an upper-layer platform of the MES interconnection and interworking system and forming the MES system unbalanced data into a sample data set; the sample data set comprises a positive sample data set and a negative sample data set;

the first judgment result obtaining module is used for obtaining the difference value between the number of positive samples in the positive sample data set and the number of negative samples in the negative sample data set, judging whether the difference value is smaller than a difference value threshold value or not and obtaining a first judgment result;

the first grade unbalanced data set judging module is used for judging that the positive sample data set is the first grade unbalanced data set if the first judgment result shows that the positive sample data set is the first grade unbalanced data set;

the first expanded positive sample data set obtaining module is used for expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;

the second-level unbalanced data set judging module is used for judging that the positive sample data set is the second-level unbalanced data set if the first judgment result shows that the positive sample data set is not the second-level unbalanced data set;

a second extended positive sample data set obtaining module, configured to extend the second-level unbalanced data set by using a density-based SMOTE algorithm to obtain an extended positive sample data set;

and the MES system balance data set forming module is used for forming an MES system balance data set by the expanded positive sample data set and the expanded negative sample data set.

The first augmented positive sample data set obtaining module specifically includes:

the neighbor sample obtaining submodule is used for obtaining K neighbor samples, which are closest to each positive sample in the sample data set, of each positive sample in the first-level unbalanced data set;

a new positive sample obtaining submodule for utilizing a formula according to the boundary positive sample and the nearest K neighbor samples of the boundary positive sample

the first new positive sample set forming submodule is used for forming a first new positive sample set by new positive samples of all boundary positive samples in the first-level unbalanced data set;

the second judgment result obtaining submodule is used for judging whether the number of positive samples in the expanded first-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not and obtaining a second judgment result;

a third judgment result obtaining submodule, configured to, if the second judgment result indicates no, judge whether the number of positive samples in the expanded first-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a third judgment result;

the expanded positive sample data set output submodule is used for outputting an expanded first-level unbalanced data set if the third judgment result shows that the data set is positive;

an updating submodule, configured to update the first-level unbalanced data set to the expanded first-level unbalanced data set if the third determination result indicates that the first-level unbalanced data set is negative, and return to the step of "obtaining K nearest neighbor samples, in the sample data set, of each positive sample in the first-level unbalanced data set, where the K nearest neighbor samples are closest to each positive sample";

wherein p is _i For the ith boundary positive sample,

The second extended positive sample data set obtaining module specifically includes:

a density determination submodule for utilizing a formula

an initial value setting submodule for letting l =1;

a sub-module for determining the number of samples for using the formula m _l ＝κ×ρ(p _l ) Determining the positive sample p corresponding to the ith density in the density set _l Number m of samples _l ；

A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level unbalanced dataset _l Nearest m _l A neighboring positive sample;

Obtaining m _l A new sample;

an updated second new positive sample set obtaining submodule for obtaining m _l Adding the new samples into a second new positive sample set to obtain an updated second new positive sample set;

the expanded second-level unbalanced data set obtaining submodule is used for merging the updated second new positive sample set and the second-level unbalanced data set to obtain an expanded second-level unbalanced data set;

a fourth judgment result obtaining submodule, configured to judge whether the number of positive samples in the expanded second-level unbalanced data set is greater than the number of negative samples in the negative sample data set, and obtain a fourth judgment result;

the deleted expanded second-level unbalanced data set output submodule is used for randomly deleting new positive samples in the expanded second-level unbalanced data set if the fourth judgment result shows that the number of the positive samples in the deleted expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, and outputting the deleted expanded second-level unbalanced data set;

a fifth judgment result obtaining submodule, configured to, if the fourth judgment result indicates no, judge whether the number of positive samples in the expanded second-level unbalanced data set is equal to the number of negative samples in the negative sample data set, and obtain a fifth judgment result;

the expanded second positive sample data set output submodule is used for outputting an expanded second-level unbalanced data set if the fifth judgment result shows that the data set is positive;

wherein l is more than or equal to 1 and less than or equal to min date, p _j For the jth positive sample, p, in the second level unbalanced data set _g For the g-th positive sample in the second level unbalanced data set, ρ (p) _j ) For positive samples p in the second level unbalanced dataset _j Min date is the number of positive samples in the unbalanced data set of the second level, dis (p) _j ,p _g ) For positive samples p in the second level unbalanced dataset _j And positive sample p _g The Euclidean distance between the two electrodes,

n is the dimension of the positive sample, p _jk Is a positive sample p _j Data of the k-th dimension of (1), p _gk Is a positive sample p _g K, is an adjustable coefficient,

The system further comprises: and the visualization and storage module is used for visualizing and storing the MES system balance data set.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An unbalanced data-oriented small sample data expansion method, characterized in that the method comprises:

extracting MES system unbalanced data from an upper platform of an MES interconnection and intercommunication system, and forming a sample data set by all the MES system unbalanced data; the sample data set comprises a positive sample data set and a negative sample data set;

expanding the first-level unbalanced data set by adopting a Borderline-SMOTE algorithm to obtain an expanded positive sample data set;

if the first judgment result shows that the data set is not the second-level unbalanced data set, judging that the positive sample data set is the second-level unbalanced data set;

the expanded positive sample data set and the negative sample data set form an MES system balance data set;

storing the MES system balanced data set, and then visualizing the data by utilizing a visualization tool to evaluate the MES data;

the method for obtaining the expanded positive sample data set includes the steps that a Borderline-SMOTE algorithm is adopted to expand the first-level unbalanced data set, and the method specifically includes the following steps:

wherein p is _i For the ith boundary positive sample,

2. The imbalance-data-oriented small sample data expansion method according to claim 1, wherein the expanding the second-level imbalance data set by using a density-based SMOTE algorithm to obtain an expanded positive sample data set specifically comprises:

using formulas

initializing a second new positive sample set as an empty set;

let l =1;

using the formula m _l ＝κ×ρ(p _l ) Determining a positive sample p corresponding to the ith density in the density set _l Number m of samples _l ；

Obtaining a positive sample p in the second level imbalance dataset _l Nearest m _l A neighboring positive sample;

according to the positive sample p _l And m _l Each neighboring positive sample using a formula

Obtaining m _l A new sample;

judging whether the number of positive samples in the expanded second-level unbalanced data set is larger than the number of negative samples in the negative sample data set or not, and obtaining a fourth judgment result;

if the fourth judgment result shows that the number of the positive samples in the expanded second-level unbalanced data set is equal to the number of the negative samples in the negative sample data set, the deleted expanded second-level unbalanced data set is output;

wherein l is more than or equal to 1 and less than or equal to mindate, p _j For the jth positive sample, p, in the second level unbalanced data set _g For the g-th positive sample in the second level unbalanced data set, ρ (p) _j ) For positive samples p in the unbalanced dataset of the second level _j Density of (d), mindate is the number of positive samples in the second level unbalanced data set, dis (p) _j ,p _g ) For positive samples p in the unbalanced dataset of the second level _j And positive sample p _g The Euclidean distance between the two electrodes,

3. The unbalanced-data-oriented small sample data expansion method of claim 1, wherein the expanded positive sample data set and the negative sample data set constitute an MES system balanced data set, and then further comprising:

and visualizing and storing the MES system balance data set.

4. An unbalanced data oriented small sample data augmentation system, the system comprising:

the second-level unbalanced data set judging module is used for judging that the positive sample data set is a second-level unbalanced data set if the first judgment result shows that the positive sample data set is not the first-level unbalanced data set;

an MES system balance data set forming module, which is used for forming an MES system balance data set by the extended positive sample data set and the negative sample data set;

wherein, the first extended positive sample data set obtaining module specifically includes:

a new positive sample obtaining submodule for utilizing a formula according to the boundary positive sample and the nearest positive sample in the K adjacent samples of the boundary positive sample

an updating submodule, configured to update the first-level unbalanced data set to an expanded first-level unbalanced data set if the third determination result indicates that the first-level unbalanced data set is negative, and return to the step of "obtaining K nearest neighbor samples, in the sample data set, of each positive sample in the first-level unbalanced data set, where the K nearest neighbor samples are closest to each positive sample";

wherein p is _i For the ith boundary positive sample,

is a boundary positive sample p _i The d adjacent positive sample in the K adjacent samples is the nearest, K is the boundary positive sample p _i Number of nearest neighbor samples, m _i Is a boundary positive sample p _i Number of nearest positive samples, p, of nearest K nearest neighbor samples _inew,d Is a positive sample p _i The rand () is a random function, and the rand (0, 1) is a random number generated within (0, 1).

5. The unbalanced-data-oriented small sample data expansion system of claim 4, wherein the second expanded positive sample data set obtaining module specifically comprises:

a density determination submodule for utilizing a formula

an initial value setting submodule for letting l =1;

A neighboring positive sample acquisition submodule for acquiring a positive sample p in the second level imbalance dataset _l Nearest m _l A neighboring positive sample;

Obtaining m _l A new sample;

wherein l is more than or equal to 1 and less than or equal to mindate, p _j For the jth positive sample, p, in the second level unbalanced data set _g For the g-th positive sample in the second level unbalanced data set, ρ (p) _j ) For positive samples p in the second level unbalanced dataset _j Density of (d), mindate is the number of positive samples in the second level unbalanced data set, dis (p) _j ,p _g ) For positive samples p in the second level unbalanced dataset _j And positive sample p _g The Euclidean distance between the two electrodes,

6. The imbalance data-oriented small sample data augmentation system of claim 4, further comprising: