CN109977094B - Semi-supervised learning method for structured data - Google Patents
Semi-supervised learning method for structured data Download PDFInfo
- Publication number
- CN109977094B CN109977094B CN201910091581.7A CN201910091581A CN109977094B CN 109977094 B CN109977094 B CN 109977094B CN 201910091581 A CN201910091581 A CN 201910091581A CN 109977094 B CN109977094 B CN 109977094B
- Authority
- CN
- China
- Prior art keywords
- sample
- discriminator
- generator
- data
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000005516 engineering process Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 26
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 10
- 230000000295 complement effect Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000009792 diffusion process Methods 0.000 claims description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 claims 1
- 230000003042 antagnostic effect Effects 0.000 abstract description 4
- 239000006185 dispersion Substances 0.000 abstract description 4
- 238000003062 neural network model Methods 0.000 abstract description 4
- 238000004519 manufacturing process Methods 0.000 abstract description 2
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 16
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 16
- 238000012360 testing method Methods 0.000 description 10
- 238000007418 data mining Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for semi-supervised learning of structured data, which comprises the steps of constructing a semi-supervised antagonistic neural network model for the structured data, preprocessing original structured data X, and dividing the characteristics of the original data X into a class type characteristic subset XCTSum numerical feature subset xNL(ii) a The original input to the model arbiter is { x }1,xu,xgIn which x1,xuRespectively marked and unmarked samples, xgFor the samples generated by the generator, x1,xuThe included feature sets are the same, and the class features of the sample are sub-set xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then, the numerical feature subset x is compared withNLCombining to obtain a sample E (x) with a new feature setCT)+xNLApplying BN technology to obtain normalized sample containing new characteristic set, inputting new sample into discriminator to train and generating sample xgDirectly as the input of the discriminator; a generator consisting of three layers of fully-connected networks, wherein the output of each layer applies BN to prevent gradient dispersion, and noise is used as output to obtain a characteristic E (x)CT)+xNLProduction sample xg。
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method for semi-supervised learning of structured data.
Background
The semi-supervised learning problem has attracted a lot of attention in many areas, such as: anomaly detection, email archiving, and the like. While the raw data in many real-world applications is structured and label-free, a supervised learning task requires a large amount of manually labeled data as a training set, and the quality of manually labeled data means more manpower, domain knowledge, and time overhead. However, semi-supervised is a learning paradigm proposed to solve this conflict, which can utilize a large amount of easily available label-free data and a small amount of artificial label data to enhance the final performance of the classifier. Essentially, this approach uses a large amount of label-free data to correct the learned hypotheses on the label data.
Currently, many different technical routes have been proposed to solve the semi-supervised learning problem. Mainly comprises the following steps: (1) self-label technique: on the premise that the classifier is more inclined to correctly predict the label, the classifier trained on a small amount of labeled data is utilized, and the prediction result with the highest confidence coefficient of each unlabeled sample is used as a corresponding new label to obtain a larger labeled data set. The method mainly comprises the following steps: famous semi-supervised learning algorithms such as Democratic-Co, Tri-tracing, Co-Bagging, Co-tracing and the like; (2) the method for generating the model and re-labeling the clusters comprises the following steps: generative models are algorithms that the field first attempts to utilize unlabeled samples. A joint probability model p (x, y) ═ p (y) p (x | y) is assumed, where p (x | y) is a known mixture distribution, such as a gaussian mixture model. This is therefore a deterministic parametric model using labeled and unlabeled data. The clustering and re-labeling method is very similar to the generated model, but is not a probabilistic model, and the general idea is to cluster the whole data set and label each cluster with the help of the labeled data; (3) taking unmarked and marked samples as Graph nodes, taking the similarity between the nodes as Graph edges, and changing the semi-supervision problem into a Graph cutting problem; (4) the Semi-supervised supported vector machines (S3VM) are used for the extension of standard SVM without labeled data, and realize the clustering assumption of Semi-supervised learning, namely samples of the same data cluster have similar labels, so that the collected unlabeled data cannot be segmented, and the classes are well separated.
However, in data mining or any type of predictive modeling, data comes before the algorithm/method. This is the main reason that machine learning requires a lot of feature engineering before some specific tasks such as anomaly detection, email archiving and many other "anomaly" data that cannot be directly entered into the SSL model described above. Many of the proposed semi-supervised learning algorithms described above also require extensive feature engineering to achieve good model performance. Instead, these types of tasks can be well accomplished with deep neural networks without any annoying and time consuming feature engineering. Most of the time, these tasks require domain knowledge, creativity and extensive trial and error. Of course, domain expertise and sophisticated feature engineering are still very valuable.
Disclosure of Invention
Aiming at the technical problems, the invention constructs a new semi-supervised learning algorithm suitable for structured data based on a countermeasure neural network (GAN), wherein the model mainly comprises a generator and a discriminator, and the discriminator aims to firstly judge whether a sample is a real sample or a generator generated sample and secondly classify the sample with class labels; the purpose of the generator is to input gaussian noise to produce a generated sample that is spurious. Through the comparison and discrimination between the generator and the discriminator, a generator capable of generating false-to-false and a discriminator with high true-to-false resolution capability are finally obtained, and meanwhile, a small amount of labeled samples are utilized to train the discriminator to obtain an excellent classifier. The whole model is composed of multiple layers of fully-connected networks, so that complex feature engineering is reduced or even avoided, and the classification performance of the model on structured data is obviously improved.
Inspired by the advantages of neural networks, based on the research of Feature Matching (FM) GANs and Bad GANs, Embedding GAN (EmGAN for short) which is modified aiming at the application scene of data mining is used in the semi-supervised learning of structured data, and a large amount of Feature engineering is saved compared with the traditional semi-supervised learning algorithm.
Structured data means you can see rows as collected data points or observations and columns as fields representing individual attributes of each observation.
The main differences between structured data and image data are: (1) unstructured data such as images are typically treated as a single entity in units of quantities such as: pixel, audio frequency, etc. The structured data contains more data types, mainly: numerical data and category data. (2) Structured data does not have a spatial correlation with pixels of image data.
As described in (1), there are a lot of category characteristics in structured data, and conventional processing methods are tag/value encoding and one-hot encoding, but these techniques have problems in terms of memory and true representation of category hierarchy. And the continuous nature of neural networks limits their application to class features. Therefore, representing class features with integer numbers and then applying neural networks directly, does not yield good results. By using the thought of algorithms such as DeepFM in the entity embedding and recommending system for reference, the method maps the class characteristics of each real sample into a high-dimensional dense space, and then inputs the class characteristics and other numerical characteristics into a discriminator for training; under the influence of (2), the GAN does not use convolution, pooling and the like for network structures on the CV, and the loss of characteristic information is avoided. Combining with the theoretical research of Feature Matching (FM) GANs and Bad GANs, the loss function combination of generators and discriminators suitable for data mining scenes is provided. And finally, compared with other semi-supervised learning algorithms applied to structured data on public data sets of UCI and KEEL.
The present invention is directed to at least solving the problems of the prior art. Therefore, the invention discloses a method for semi-supervised learning of structured data, which is characterized by constructing a semi-supervised antagonistic neural network model structure (called generalized adaptive Net, called Embedding GAN for short) for structured data, preprocessing original data X, dividing a feature set of the original data X into a class-type feature subset XCTSum numerical feature subset xNLTwo parts; original input sample division of model's arbiter into { x }1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, class feature x is appliedCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample xgDirect connectionAs an input to the arbiter; generator G (z; theta)g) The system consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient dispersion; will conform to the probability distribution pz(z) noise as input, let the sample distribution p be generatedGFitting removalLearning an input noise z into data E (x) by a multi-layered perceptronCT)+xNLFinally, a generated sample x close to the real sample is outputg(ii) a Where P (y ═ K +1| x) represents the probability that x is a false sample.
Further, recombining the real samples into a new feature sample E (x) through Embedding layerCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
where G represents a functional expression of the generator model, the function f is an expression of the output statistical characteristic of the discriminator D,i.e. the statistical properties of the real sample x,the statistical properties resulting from generating samples for the input noise via generator G and then inputting function f.
Furthermore, in order to ensure strong true and false confidence under the optimal condition, a conditional entropy term is added in the objective functionThe new discriminator objective function is:
wherein:
wherein,input labeled data x representing discriminator D1And a corresponding class label y1Then, the confidence (or probability) of each of the previous K classes and the real label y are outputlCross entropy of (d);representing the input of discriminator D as unlabelled sample xuThen, outputting the cross entropy between the confidence coefficient of the front K classes and the real label y which is less than or equal to K;representing the input of discriminator D as unlabelled sample xgThen, the model outputs the confidence coefficient of the K +1 class and the cross entropy of the real label y which is K + 1;i.e. conditional entropy terms, representing the input as unmarked samples xuAnd calculating the information entropy of the K categories according to the confidence degrees corresponding to the K categories.
The new generator and discriminator objective function combination can produce a complementary generator (complementary generator) to help the discriminator find the correct classification decision boundary, and in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the discriminator objective function.
The invention has the beneficial effects that: 1. the antagonistic neural network model is applied to a semi-supervised learning task of structured data, and a better effect is achieved; 2. aiming at the characteristic that a large number of class characteristics exist in structured data, the class characteristics of each real sample are mapped into a high-dimensional dense space by using an Embedding layer, and then are input into a discriminator together with other numerical characteristics for training, so that the performance of the model can be effectively improved; 3. aiming at the application scene of the structured data, a new generator and discriminator objective function combination is provided, so that the generator can effectively generate a complete sample, the discriminator can more accurately find a decision boundary between real categories, and an excellent classifier model is obtained.
Drawings
The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.
FIG. 1 is a block diagram of a semi-supervised learning approach for structured data in accordance with the present invention;
FIG. 2 is basic information of experimental data in one embodiment of the invention;
FIG. 3 is a parameter set of experimental data in one embodiment of the invention;
FIG. 4 is an experimental result on an unlabeled dataset in one embodiment of the invention;
FIG. 5 is experimental results on a test data set in one embodiment of the invention.
Detailed Description
Example one
The invention mainly provides a new model Embedding GAN (EmGAN) suitable for structured data on the basis of semi-supervised GAN (semi-supervised), and the following three aspects are described in detail: the model structure, the generator and the objective function of the arbiter.
1. Model structure
The structure of the whole algorithm model is shown in fig. 1. This model is divided into three parts by dashed boxes:
A) the upper left corner is the pre-processing portion of the original data x (structured data containing class K labels), which includes labeled samples xlAnd its label ylSample x without labeluAnd test set samples { xtest,ytest}. As illustrated, we partition the set of features of the raw data x into a subset of categorical features xCTSum numerical feature subset xNLTwo parts.
B) Within the right dashed box is a six-layer fully connected network D (x; thetad)(θdThe model parameters of the discriminator, including the parameters of the Embedding layer in the figure) as the discriminator, that is, the classifier finally needed to be obtained by the semi-supervised learning. The original input to the discriminator is { x }l,xu,xgIn which xgAre samples generated by a generator as will be described below. For better extraction of true samples xl,xuThe subset of class features xCTBased on the potential semantic information, we first classify the feature xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombine to get a new sample feature set E (x)CT)+xNLAnd applying Batch Normalization (NB) techniques to obtain normalized samples containing the new feature set. Finally, inputting the new sample into a discriminator to train, and generating a sample xgThen it is directly used as an input to the arbiter. Increasing the output dimension from K class to K +1 class of real data we will use pmodel(y-K +1| x) to indicate the probability that x is a false sample.
C) The lower left corner is the generator G (z; thetag)(θgTo model parameters of the generator) consisting of a three-layer fully-connected network and BN applied to the output of each layer to prevent gradient dispersion. And the division of semi-supervised FM GANs by raw data x learning generators as previously proposedCloth pGIn contrast, here the noise parameter p isz(z) as input, let pGFitting removalLearning an input noise z into data E (x) by a multi-layered perceptronCT)+xNLFinally, a generated sample x close to the real sample is outputgWhich is also one of the inputs to the above mentioned arbiter.
2. Objective function of generator
Feature matching is a generator objective function proposed previously to solve the instability of GAN training, and can effectively avoid over-training of a generator under the current arbiter. Whose objective function is no longer simple to maximize the generated samples as output of the arbiter at the inputBut instead requires the generator's generated samples g (z) to match the statistical properties of sample x, where we simply use a discriminator to specify which statistical properties are more worth matching. Specifically, the training generator matches the output value of the middle layer of the discriminator, and the activation output of the last full-link layer is matched in EmGAN. Since the purpose of the discriminator is to find the most distinctive features under the current model between the generated samples g (z) and the real samples x, the new generator objective function is a natural choice of statistical properties. And experiments demonstrate that Feature matching is indeed very effective in cases where conventional GAN becomes unstable.
The original Feature matching loss function is:
where G represents a functional expression of the generator model, the function f is an expression of the output statistical characteristic of the discriminator D,i.e. the statistical properties of the real sample x,the statistical properties resulting from generating samples for the input noise via generator G and then inputting function f.
When introducing a model structure, it is mentioned that in order to better express potential semantic information of class features and improve the learning performance of the whole network, we recombine real samples into new feature samples E (x) through Embedding layerCT)+xNLAnd we require the generator to match E (x)CT)+xNLSo in EmbeddingGAN, the loss function of the generator is:
3. objective function of discriminator
It was previously noted in the research literature that when a perfect generator is available (i.e., the generator produces samples that are not distinguished from true samples), the probability that the discriminator adds a class K +1 to represent false samples does not help to improve generalization performance. So in theory, a compensation generator is trained to assist in producing an excellent semi-supervised learning classifier. When G is a generator that can produce a complete sample, ideally a near-optimal arbiter D can find the correct decision boundary in feature space between high-density subsets of data. That is, the complete sample generated by the generator will make the output probability of the real category of the discriminator low. Finally, the discriminator gets the correct class boundary in the low density region.
For the above-mentioned complete generator to function, the discriminator needs to apply the unlabeled sample xuHas high true and false confidence. However, the most primitive GAN's discriminator objective function does not meet this requirement, it only needs the discriminator to satisfyGo to getTo the correct classification boundary. Probability p if K ≦ KD(k | x) is uniformly distributed, the original objective function cannot enhance the classification performance of the discriminator. In order to guarantee strong true and false confidence under the optimal condition, a conditional entropy item is added in an objective functionThe new discriminator objective function is:
wherein:
in the above formulaInput labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence (or probability) of each of the previous K classes and the real label y are outputlCross entropy of (d);representing the input of discriminator D as unlabelled sample xuThen, outputting the cross entropy between the confidence coefficient of the front K classes and the real label y which is less than or equal to K;representing the input of discriminator D as unlabelled sample xgThen the model outputs K +The confidence of class 1 and the cross entropy of a real label y which is K + 1;i.e. conditional entropy terms, representing the input as unmarked samples xuAnd calculating the information entropy of the K categories according to the confidence degrees corresponding to the K categories.
In practical data mining applications, this new generator and discriminator objective function combination can generate a valid complement generator to help the discriminator find the correct classification decision boundary. In the actual training process, in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the objective function of the discriminator.
The invention discloses a method for semi-supervised learning of structured data, which constructs a semi-supervised antagonistic neural network model structure (called Embedding generic adaptive Net for short) for structured data, preprocesses original structured data X, divides the characteristics of the original data X into category (category) characteristic subset XCTSum numerical (statistical) feature subset xNLTwo parts; the original input to the model's discriminator is { x1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the samples generated by the generator, x1,xuThe included feature sets are the same, firstly, the class features of the sample are sub-set xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample xgDirectly as the input of the discriminator; generator G (z; theta)g) The system consists of three layers of fully-connected networks, BN is applied to the output of each layer to prevent gradient dispersion, and noise is used as output to obtain congestionHaving the characteristic E (x)CT)+xNLProduction sample xg。
Example two
In the present embodiment, the algorithm is described, and may be developed based on any programming language and hardware environment, and is not limited by the implementation environment of the following specific implementation example.
A method for semi-supervised learning of structured data comprises the steps of constructing an EmbeddingGAN model structure suitable for structured data, preprocessing (including missing value filling, class feature numeralization processing and the like) original data X, dividing a feature set of the processed original data X into class-type feature subsets XCTSum numerical feature subset xNLTwo parts; model discriminator D (x; theta)d) (x is the input sample, θ)dThe model parameters for the discriminator, including the parameters of the Embedding layer in the figure) are given as { x }1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, class feature x is appliedCTInputting an Embedding layer (a neural network structure capable of converting input numerical values into corresponding multidimensional vectors) to obtain corresponding dense embedded vectors E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd applying Batch Normalization (BN) (Batch training mode is used for training neural network, so the technique is to normalize the characteristic value of each Batch of samples) technique to obtain normalized samples containing new characteristic set, finally inputting the normalized new samples into a discriminator to train, and generating samples xgDirectly as the input of the discriminator; generator G (z; theta)g) (z is input noise, θ)gModel parameters of a generator), which consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient diffusion which possibly occurs during neural network training; will conform to the probability distribution pz(z) noise as input, let the sample x be generatedgProbability distribution p ofGRemoving new samples processed before fittingFinally learning an input noise z to the data E (x) generated by the real sample by the multi-layer perceptronCT)+xNLFinally, a generated sample x close to the true sample distribution is outputg。
Further, recombining the real samples into a sample E (x) with new features through Embedding layerCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
wherein G represents a functional expression of the generator model, f is a functional expression of the intermediate layer output statistical properties of the discriminator D,i.e. the processed real sample E (x)CT)+xNLThe statistical properties of (a) output an expectation,for the input noise to pass through the generator G to generate the generated sample and then input the statistical property expectation obtained by the function f, the statistical property expectation of the generated sample is expected to be close to that of the real sample, so the parameter theta of the generator is optimizedgThis loss function is minimized.
Furthermore, in order to ensure strong true and false confidence under the optimal condition, a conditional entropy term is added in the objective functionThe new discriminator objective function is:
which can be divided into supervised learning termsUnsupervised learning itemAnd newly added conditional entropy terms
Wherein,input labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence (or probability) of each of the previous K classes and the real label y are outputlThe desired value of the cross entropy of (a);representing the input of discriminator D as unlabeled true sample xuThen, the cross entropy expectation of the sum of the confidence degrees of the first K classes (representing the classes of the real samples, wherein the real samples may have K different classes) and the real label y thereof being less than or equal to K is output;the input representing the discriminator D is the generated sample xgThen, the model outputs a cross entropy expectation of the confidence of the K +1 class and the true label y ═ K +1 (the K +1 class is the class label of the generated sample);i.e. the expectation of the conditional entropy term, the input is represented as unmarked sample xuAnd calculating the information entropy of the confidence degrees corresponding to the K categories.
The new generator and discriminator target function combination can be trained to obtain a complementary generator (complementary generator) to help the discriminator to find the correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the target function of the generator, and simultaneously the parameters of the discriminator are updated once to optimize the discriminator target function.
In this embodiment, the specific steps of model training are as follows:
1. based on the description of the model structure, the whole network structure is realized by using a computer language, which comprises the following steps: the generator and the discriminator.
2. The original labeled data was divided into ten using a ten-fold intersection method, and then each was used as a test set and the other nine as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one.
3. The original training data is divided into labeled data and unlabeled data, and non-numeric type class features are encoded with numeric values.
4. First, the unmarked data is copied to the same data size as the marked data. Then, the category characteristics x of the labeled data and the unlabeled data are usedCTInput deviceThe Embedding layer obtains their embedded vector E (x)CT). The embedded vector is then combined with other original features to obtain a new sample E (x)CT)+xNLThe input discriminators obtain their outputsAndsimultaneously, Gaussian random noise z is sequentially input into a generator and a discriminator to obtain statistical characteristic output of a generated sample G (Z)And finally, utilizing an objective function of the discriminator:
and (4) performing back propagation of the gradient, and updating all the discriminators and the Embedding layer parameters.
5. Then, the noise z is input into the generator again to obtain a generated sample G (Z), and then G (Z) and E (x) obtained by subjecting the unmarked sample to Embedding layerCT)+xNLSimultaneously input into the discriminator to obtain their outputs D (G (z)) and D (E (x)CT)+xNL) Finally, using the objective function of the generator
Parameters of a back-propagation update generator
And repeating the steps 4 and 5 for a plurality of times until the preset maximum iteration times are reached or the objective function values of the generator and the discriminator are not obviously changed any more, and saving the parameters of the whole model to a hard disk, thereby obtaining an excellent semi-supervised classification model.
The experimental data source is 11 groups of public semi-supervised learning data sets of UCI and KEEL. FIG. 2 shows the basic information of the Data, where Data set in the table is the name of the Data, # samples is the sample size of the Data, # Features (Category/Numerical) is the feature number and feature type of the Data, and # Classes is the Category number of the Data. The data was divided into tenths using a ten-fold cross over method on the original data, and then each was taken as a test set and the other nine were taken as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one. That is, we will eventually perform control experiments on 11X 4 data sets
The experiment was performed by comparison with 6 existing semi-supervised learning algorithms to compare the performance of the two indexes, Accuracy and Cohen's Kappa. These 6 algorithms are all designed based on one or more specified base algorithms, and we will use four different base algorithms for each semi-supervised learning algorithm, respectively. That is, the invention will eventually be compared to the different algorithms in 24
The 6 comparison algorithms and their base algorithms and the algorithm specific parameter settings of the present invention are shown in figure 3,
wherein EmGAN is the algorithm of the invention, KNN, C4.5, Naive Bayes and SMO are four basic algorithms, and Democratic-Co, Self-tracing, Co-Bagging, Tritracing and DE-Tritracing are 8 semi-supervised learning algorithms.
In the embodiment, the steps described in the present embodiment are used to obtain the whole EmGAN model, and we use the discriminator model to perform class prediction on the unlabeled data and the test data set of each data set, and calculate two indexes, Accuracy and Cohen's Kappa, by combining their true class labels, and the final experimental results are shown in fig. 4 and 5.
Experimental analysis: as shown in fig. 3 and 4, the bold font indicates the algorithm that performs best on the data set, and it can be observed that the algorithm provided by the present invention performs best on the unlabeled data set in terms of Accuracy and Cohen's Kappa. On the test data set, the algorithm provided by the invention also achieves the best effect on the proportion of three marked samples of 20%, 30% and 40%. In summary, the effectiveness of the proposed method can be demonstrated.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.
Claims (3)
1. A method for semi-supervised learning of structured data is characterized by constructing an Embedding GAN model structure suitable for structured data, preprocessing original data X, wherein the preprocessing comprises missing value filling and class feature digitization, and dividing a feature set of the processed original data X into class-type feature subsets XCTSum numerical feature subset xNLTwo parts; model discriminator D (x; theta)d) Is { x }l,xu,xgWhere x is the input sample, θdThe model parameters include the parameters of an Embedding layer, which is a neural network structure capable of converting input values into corresponding multidimensional vectors, xl,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, the categorical features are subset xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization technology, and finally inputting the normalized new sample into a discriminator for training to generate a sample xgDirectly as the input of the discriminator; generator G (z; theta)g) Where z is the input noise, θgFor the model parameters of the generator, the generator consists of three layers of fully-connected networks, and the output of each layer applies Batch Normalization to prevent gradient diffusion which may occur during neural network training; will conform to the probability distribution pz(z) noise as input, let the sample x be generatedgProbability distribution p ofGFitting samples E (x) with new feature setCT)+xNLProbability distribution ofFinally learning through a multi-layer perceptronInputting noise z to sample E (x) with new characteristic setCT)+xNLFinally, a generated sample x close to the true sample distribution is outputg。
2. The method of claim 1, wherein recombining real samples via an Embedding layer to form a sample E (x) with a new feature setCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
wherein G represents a functional expression of the generator model, f is a functional expression of the intermediate layer output statistical properties of the discriminator D,i.e. the processed sample E (x) with the new feature setCT)+xNLThe statistical properties of (a) output an expectation,for the input noise, the generator G generates a generation sample and then inputs the statistical property expectation obtained by the function f, and finally the statistical property expectation of the generation sample is expected to be close to that of a real sample, so that the parameter theta of the generator is optimizedgThis loss function is minimized.
3. Method according to claim 2, characterized in that in order to guarantee strong true and false confidence under optimal conditions, the term L of conditional entropy is added to the objective functionentropyThe new discriminator objective function is:
which can be divided into supervised learning termsUnsupervised learning itemAnd newly added conditional entropy terms
Wherein,input labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence coefficient and the real label y of each front K classes are outputlThe desired value of the cross entropy of (a);the input representing the discriminator D is a unlabelled data sample xuThen, the sum of the confidence degrees of the previous K classes and the cross entropy expectation of the real label y of the sum of the confidence degrees of the previous K classes, which is less than or equal to K, are output;the input representing the discriminator D is the generated sample xgModel inputThe confidence coefficient of the K +1 class and the cross entropy expectation of the real label y which is K +1 are obtained;i.e. the expectation of the conditional entropy term, representing the input as unmarked data samples xuCalculating the information entropy of confidence degrees corresponding to the K categories;
the new generator and discriminator objective function combination can be trained to obtain a complementary generator to help the discriminator to find a correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the objective function of the generator, and simultaneously, the parameters of the discriminator are updated once to optimize the discriminator objective function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091581.7A CN109977094B (en) | 2019-01-30 | 2019-01-30 | Semi-supervised learning method for structured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091581.7A CN109977094B (en) | 2019-01-30 | 2019-01-30 | Semi-supervised learning method for structured data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109977094A CN109977094A (en) | 2019-07-05 |
CN109977094B true CN109977094B (en) | 2021-02-19 |
Family
ID=67076794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910091581.7A Active CN109977094B (en) | 2019-01-30 | 2019-01-30 | Semi-supervised learning method for structured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977094B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704221B (en) * | 2019-09-02 | 2020-10-27 | 西安交通大学 | Data center fault prediction method based on data enhancement |
CN110569842B (en) * | 2019-09-05 | 2022-08-12 | 江苏艾佳家居用品有限公司 | Semi-supervised learning method for GAN model training |
CN110738309B (en) * | 2019-09-27 | 2022-07-12 | 华中科技大学 | DDNN training method and DDNN-based multi-view target identification method and system |
CN110719279A (en) * | 2019-10-09 | 2020-01-21 | 东北大学 | Network anomaly detection system and method based on neural network |
CN111240279B (en) * | 2019-12-26 | 2021-04-06 | 浙江大学 | Confrontation enhancement fault classification method for industrial unbalanced data |
CN111444959A (en) * | 2020-03-26 | 2020-07-24 | 常州工业职业技术学院 | Construction method of stack-type structure hierarchical classification model based on SVM |
CN111949886B (en) * | 2020-08-28 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Sample data generation method and related device for information recommendation |
CN112232395B (en) * | 2020-10-08 | 2023-10-27 | 西北工业大学 | Semi-supervised image classification method for generating countermeasure network based on joint training |
CN113505664B (en) * | 2021-06-28 | 2022-10-18 | 上海电力大学 | Fault diagnosis method for planetary gear box of wind turbine generator |
CN113951868B (en) * | 2021-10-29 | 2024-04-09 | 北京富通东方科技有限公司 | Method and device for detecting man-machine asynchronism of mechanical ventilation patient |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390171A (en) * | 2013-07-24 | 2013-11-13 | 南京大学 | Safe semi-supervised learning method |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN107392147A (en) * | 2017-07-20 | 2017-11-24 | 北京工商大学 | A kind of image sentence conversion method based on improved production confrontation network |
CN107590262A (en) * | 2017-09-21 | 2018-01-16 | 黄国华 | The semi-supervised learning method of big data analysis |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10607319B2 (en) * | 2017-04-06 | 2020-03-31 | Pixar | Denoising monte carlo renderings using progressive neural networks |
US20180314932A1 (en) * | 2017-04-28 | 2018-11-01 | Intel Corporation | Graphics processing unit generative adversarial network |
-
2019
- 2019-01-30 CN CN201910091581.7A patent/CN109977094B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390171A (en) * | 2013-07-24 | 2013-11-13 | 南京大学 | Safe semi-supervised learning method |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN107392147A (en) * | 2017-07-20 | 2017-11-24 | 北京工商大学 | A kind of image sentence conversion method based on improved production confrontation network |
CN107590262A (en) * | 2017-09-21 | 2018-01-16 | 黄国华 | The semi-supervised learning method of big data analysis |
Non-Patent Citations (1)
Title |
---|
基于多源多层次信息融合的网络安全态势感知方法;邓晓衡等;《上海交通大学学报》;20150831;第49卷(第8期);第1144-1152页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109977094A (en) | 2019-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977094B (en) | Semi-supervised learning method for structured data | |
CN110689086B (en) | Semi-supervised high-resolution remote sensing image scene classification method based on generating countermeasure network | |
Zheng et al. | A full stage data augmentation method in deep convolutional neural network for natural image classification | |
Canabarro et al. | Unveiling phase transitions with machine learning | |
Tang et al. | Deep safe incomplete multi-view clustering: Theorem and algorithm | |
Tsutsui et al. | Meta-reinforced synthetic data for one-shot fine-grained visual recognition | |
Zhao et al. | Embedding visual hierarchy with deep networks for large-scale visual recognition | |
Jin et al. | Cold-start active learning for image classification | |
CN107491782B (en) | Image classification method for small amount of training data by utilizing semantic space information | |
Tsai et al. | Deep learning of topological phase transitions from entanglement aspects | |
CN111753995A (en) | Local interpretable method based on gradient lifting tree | |
Hada et al. | Sparse oblique decision trees: A tool to understand and manipulate neural net features | |
Antoran et al. | Disentangling and learning robust representations with natural clustering | |
Yang et al. | Generative counterfactuals for neural networks via attribute-informed perturbation | |
Zhu et al. | Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation | |
Zhang et al. | Mixture distribution graph network for few shot learning | |
Ye et al. | Self-supervised adversarial variational learning | |
Saleknia et al. | Efficient still image action recognition by the combination of ensemble learning and knowledge distillation | |
Komarov | Reducing the search area of genetic algorithm using neural network autoencoder | |
US20230031512A1 (en) | Surrogate hierarchical machine-learning model to provide concept explanations for a machine-learning classifier | |
Cai et al. | A novel deep learning approach: Stacked evolutionary auto-encoder | |
Zhang et al. | Evolutionary computation and evolutionary deep learning for image analysis, signal processing and pattern recognition | |
Jing et al. | NASABN: A neural architecture search framework for attention-based networks | |
Ju et al. | A novel neutrosophic logic svm (n-svm) and its application to image categorization | |
Darma et al. | GFF-CARVING: Graph Feature Fusion for the Recognition of Highly Varying and Complex Balinese Carving Motifs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |