CN109977094B - Semi-supervised learning method for structured data - Google Patents

Semi-supervised learning method for structured data Download PDF

Info

Publication number
CN109977094B
CN109977094B CN201910091581.7A CN201910091581A CN109977094B CN 109977094 B CN109977094 B CN 109977094B CN 201910091581 A CN201910091581 A CN 201910091581A CN 109977094 B CN109977094 B CN 109977094B
Authority
CN
China
Prior art keywords
sample
discriminator
generator
data
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910091581.7A
Other languages
Chinese (zh)
Other versions
CN109977094A (en
Inventor
邓晓衡
黄戎
沈海澜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910091581.7A priority Critical patent/CN109977094B/en
Publication of CN109977094A publication Critical patent/CN109977094A/en
Application granted granted Critical
Publication of CN109977094B publication Critical patent/CN109977094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for semi-supervised learning of structured data, which comprises the steps of constructing a semi-supervised antagonistic neural network model for the structured data, preprocessing original structured data X, and dividing the characteristics of the original data X into a class type characteristic subset XCTSum numerical feature subset xNL(ii) a The original input to the model arbiter is { x }1,xu,xgIn which x1,xuRespectively marked and unmarked samples, xgFor the samples generated by the generator, x1,xuThe included feature sets are the same, and the class features of the sample are sub-set xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then, the numerical feature subset x is compared withNLCombining to obtain a sample E (x) with a new feature setCT)+xNLApplying BN technology to obtain normalized sample containing new characteristic set, inputting new sample into discriminator to train and generating sample xgDirectly as the input of the discriminator; a generator consisting of three layers of fully-connected networks, wherein the output of each layer applies BN to prevent gradient dispersion, and noise is used as output to obtain a characteristic E (x)CT)+xNLProduction sample xg

Description

Semi-supervised learning method for structured data
Technical Field
The invention relates to the technical field of computers, in particular to a method for semi-supervised learning of structured data.
Background
The semi-supervised learning problem has attracted a lot of attention in many areas, such as: anomaly detection, email archiving, and the like. While the raw data in many real-world applications is structured and label-free, a supervised learning task requires a large amount of manually labeled data as a training set, and the quality of manually labeled data means more manpower, domain knowledge, and time overhead. However, semi-supervised is a learning paradigm proposed to solve this conflict, which can utilize a large amount of easily available label-free data and a small amount of artificial label data to enhance the final performance of the classifier. Essentially, this approach uses a large amount of label-free data to correct the learned hypotheses on the label data.
Currently, many different technical routes have been proposed to solve the semi-supervised learning problem. Mainly comprises the following steps: (1) self-label technique: on the premise that the classifier is more inclined to correctly predict the label, the classifier trained on a small amount of labeled data is utilized, and the prediction result with the highest confidence coefficient of each unlabeled sample is used as a corresponding new label to obtain a larger labeled data set. The method mainly comprises the following steps: famous semi-supervised learning algorithms such as Democratic-Co, Tri-tracing, Co-Bagging, Co-tracing and the like; (2) the method for generating the model and re-labeling the clusters comprises the following steps: generative models are algorithms that the field first attempts to utilize unlabeled samples. A joint probability model p (x, y) ═ p (y) p (x | y) is assumed, where p (x | y) is a known mixture distribution, such as a gaussian mixture model. This is therefore a deterministic parametric model using labeled and unlabeled data. The clustering and re-labeling method is very similar to the generated model, but is not a probabilistic model, and the general idea is to cluster the whole data set and label each cluster with the help of the labeled data; (3) taking unmarked and marked samples as Graph nodes, taking the similarity between the nodes as Graph edges, and changing the semi-supervision problem into a Graph cutting problem; (4) the Semi-supervised supported vector machines (S3VM) are used for the extension of standard SVM without labeled data, and realize the clustering assumption of Semi-supervised learning, namely samples of the same data cluster have similar labels, so that the collected unlabeled data cannot be segmented, and the classes are well separated.
However, in data mining or any type of predictive modeling, data comes before the algorithm/method. This is the main reason that machine learning requires a lot of feature engineering before some specific tasks such as anomaly detection, email archiving and many other "anomaly" data that cannot be directly entered into the SSL model described above. Many of the proposed semi-supervised learning algorithms described above also require extensive feature engineering to achieve good model performance. Instead, these types of tasks can be well accomplished with deep neural networks without any annoying and time consuming feature engineering. Most of the time, these tasks require domain knowledge, creativity and extensive trial and error. Of course, domain expertise and sophisticated feature engineering are still very valuable.
Disclosure of Invention
Aiming at the technical problems, the invention constructs a new semi-supervised learning algorithm suitable for structured data based on a countermeasure neural network (GAN), wherein the model mainly comprises a generator and a discriminator, and the discriminator aims to firstly judge whether a sample is a real sample or a generator generated sample and secondly classify the sample with class labels; the purpose of the generator is to input gaussian noise to produce a generated sample that is spurious. Through the comparison and discrimination between the generator and the discriminator, a generator capable of generating false-to-false and a discriminator with high true-to-false resolution capability are finally obtained, and meanwhile, a small amount of labeled samples are utilized to train the discriminator to obtain an excellent classifier. The whole model is composed of multiple layers of fully-connected networks, so that complex feature engineering is reduced or even avoided, and the classification performance of the model on structured data is obviously improved.
Inspired by the advantages of neural networks, based on the research of Feature Matching (FM) GANs and Bad GANs, Embedding GAN (EmGAN for short) which is modified aiming at the application scene of data mining is used in the semi-supervised learning of structured data, and a large amount of Feature engineering is saved compared with the traditional semi-supervised learning algorithm.
Structured data means you can see rows as collected data points or observations and columns as fields representing individual attributes of each observation.
The main differences between structured data and image data are: (1) unstructured data such as images are typically treated as a single entity in units of quantities such as: pixel, audio frequency, etc. The structured data contains more data types, mainly: numerical data and category data. (2) Structured data does not have a spatial correlation with pixels of image data.
As described in (1), there are a lot of category characteristics in structured data, and conventional processing methods are tag/value encoding and one-hot encoding, but these techniques have problems in terms of memory and true representation of category hierarchy. And the continuous nature of neural networks limits their application to class features. Therefore, representing class features with integer numbers and then applying neural networks directly, does not yield good results. By using the thought of algorithms such as DeepFM in the entity embedding and recommending system for reference, the method maps the class characteristics of each real sample into a high-dimensional dense space, and then inputs the class characteristics and other numerical characteristics into a discriminator for training; under the influence of (2), the GAN does not use convolution, pooling and the like for network structures on the CV, and the loss of characteristic information is avoided. Combining with the theoretical research of Feature Matching (FM) GANs and Bad GANs, the loss function combination of generators and discriminators suitable for data mining scenes is provided. And finally, compared with other semi-supervised learning algorithms applied to structured data on public data sets of UCI and KEEL.
The present invention is directed to at least solving the problems of the prior art. Therefore, the invention discloses a method for semi-supervised learning of structured data, which is characterized by constructing a semi-supervised antagonistic neural network model structure (called generalized adaptive Net, called Embedding GAN for short) for structured data, preprocessing original data X, dividing a feature set of the original data X into a class-type feature subset XCTSum numerical feature subset xNLTwo parts; original input sample division of model's arbiter into { x }1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, class feature x is appliedCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample xgDirect connectionAs an input to the arbiter; generator G (z; theta)g) The system consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient dispersion; will conform to the probability distribution pz(z) noise as input, let the sample distribution p be generatedGFitting removal
Figure BDA0001963421060000035
Learning an input noise z into data E (x) by a multi-layered perceptronCT)+xNLFinally, a generated sample x close to the real sample is outputg(ii) a Where P (y ═ K +1| x) represents the probability that x is a false sample.
Further, recombining the real samples into a new feature sample E (x) through Embedding layerCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
Figure BDA0001963421060000031
where G represents a functional expression of the generator model, the function f is an expression of the output statistical characteristic of the discriminator D,
Figure BDA0001963421060000032
i.e. the statistical properties of the real sample x,
Figure BDA0001963421060000033
the statistical properties resulting from generating samples for the input noise via generator G and then inputting function f.
Furthermore, in order to ensure strong true and false confidence under the optimal condition, a conditional entropy term is added in the objective function
Figure BDA0001963421060000036
The new discriminator objective function is:
Figure BDA0001963421060000034
wherein:
Figure BDA0001963421060000041
Figure BDA0001963421060000042
Figure BDA0001963421060000043
wherein,
Figure BDA0001963421060000044
input labeled data x representing discriminator D1And a corresponding class label y1Then, the confidence (or probability) of each of the previous K classes and the real label y are outputlCross entropy of (d);
Figure BDA0001963421060000045
representing the input of discriminator D as unlabelled sample xuThen, outputting the cross entropy between the confidence coefficient of the front K classes and the real label y which is less than or equal to K;
Figure BDA0001963421060000046
representing the input of discriminator D as unlabelled sample xgThen, the model outputs the confidence coefficient of the K +1 class and the cross entropy of the real label y which is K + 1;
Figure BDA0001963421060000047
i.e. conditional entropy terms, representing the input as unmarked samples xuAnd calculating the information entropy of the K categories according to the confidence degrees corresponding to the K categories.
The new generator and discriminator objective function combination can produce a complementary generator (complementary generator) to help the discriminator find the correct classification decision boundary, and in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the discriminator objective function.
The invention has the beneficial effects that: 1. the antagonistic neural network model is applied to a semi-supervised learning task of structured data, and a better effect is achieved; 2. aiming at the characteristic that a large number of class characteristics exist in structured data, the class characteristics of each real sample are mapped into a high-dimensional dense space by using an Embedding layer, and then are input into a discriminator together with other numerical characteristics for training, so that the performance of the model can be effectively improved; 3. aiming at the application scene of the structured data, a new generator and discriminator objective function combination is provided, so that the generator can effectively generate a complete sample, the discriminator can more accurately find a decision boundary between real categories, and an excellent classifier model is obtained.
Drawings
The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.
FIG. 1 is a block diagram of a semi-supervised learning approach for structured data in accordance with the present invention;
FIG. 2 is basic information of experimental data in one embodiment of the invention;
FIG. 3 is a parameter set of experimental data in one embodiment of the invention;
FIG. 4 is an experimental result on an unlabeled dataset in one embodiment of the invention;
FIG. 5 is experimental results on a test data set in one embodiment of the invention.
Detailed Description
Example one
The invention mainly provides a new model Embedding GAN (EmGAN) suitable for structured data on the basis of semi-supervised GAN (semi-supervised), and the following three aspects are described in detail: the model structure, the generator and the objective function of the arbiter.
1. Model structure
The structure of the whole algorithm model is shown in fig. 1. This model is divided into three parts by dashed boxes:
A) the upper left corner is the pre-processing portion of the original data x (structured data containing class K labels), which includes labeled samples xlAnd its label ylSample x without labeluAnd test set samples { xtest,ytest}. As illustrated, we partition the set of features of the raw data x into a subset of categorical features xCTSum numerical feature subset xNLTwo parts.
B) Within the right dashed box is a six-layer fully connected network D (x; thetad)(θdThe model parameters of the discriminator, including the parameters of the Embedding layer in the figure) as the discriminator, that is, the classifier finally needed to be obtained by the semi-supervised learning. The original input to the discriminator is { x }l,xu,xgIn which xgAre samples generated by a generator as will be described below. For better extraction of true samples xl,xuThe subset of class features xCTBased on the potential semantic information, we first classify the feature xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombine to get a new sample feature set E (x)CT)+xNLAnd applying Batch Normalization (NB) techniques to obtain normalized samples containing the new feature set. Finally, inputting the new sample into a discriminator to train, and generating a sample xgThen it is directly used as an input to the arbiter. Increasing the output dimension from K class to K +1 class of real data we will use pmodel(y-K +1| x) to indicate the probability that x is a false sample.
C) The lower left corner is the generator G (z; thetag)(θgTo model parameters of the generator) consisting of a three-layer fully-connected network and BN applied to the output of each layer to prevent gradient dispersion. And the division of semi-supervised FM GANs by raw data x learning generators as previously proposedCloth pGIn contrast, here the noise parameter p isz(z) as input, let pGFitting removal
Figure BDA0001963421060000052
Learning an input noise z into data E (x) by a multi-layered perceptronCT)+xNLFinally, a generated sample x close to the real sample is outputgWhich is also one of the inputs to the above mentioned arbiter.
2. Objective function of generator
Feature matching is a generator objective function proposed previously to solve the instability of GAN training, and can effectively avoid over-training of a generator under the current arbiter. Whose objective function is no longer simple to maximize the generated samples as output of the arbiter at the input
Figure BDA0001963421060000051
But instead requires the generator's generated samples g (z) to match the statistical properties of sample x, where we simply use a discriminator to specify which statistical properties are more worth matching. Specifically, the training generator matches the output value of the middle layer of the discriminator, and the activation output of the last full-link layer is matched in EmGAN. Since the purpose of the discriminator is to find the most distinctive features under the current model between the generated samples g (z) and the real samples x, the new generator objective function is a natural choice of statistical properties. And experiments demonstrate that Feature matching is indeed very effective in cases where conventional GAN becomes unstable.
The original Feature matching loss function is:
Figure BDA0001963421060000061
where G represents a functional expression of the generator model, the function f is an expression of the output statistical characteristic of the discriminator D,
Figure BDA0001963421060000062
i.e. the statistical properties of the real sample x,
Figure BDA0001963421060000063
the statistical properties resulting from generating samples for the input noise via generator G and then inputting function f.
When introducing a model structure, it is mentioned that in order to better express potential semantic information of class features and improve the learning performance of the whole network, we recombine real samples into new feature samples E (x) through Embedding layerCT)+xNLAnd we require the generator to match E (x)CT)+xNLSo in EmbeddingGAN, the loss function of the generator is:
Figure BDA0001963421060000064
3. objective function of discriminator
It was previously noted in the research literature that when a perfect generator is available (i.e., the generator produces samples that are not distinguished from true samples), the probability that the discriminator adds a class K +1 to represent false samples does not help to improve generalization performance. So in theory, a compensation generator is trained to assist in producing an excellent semi-supervised learning classifier. When G is a generator that can produce a complete sample, ideally a near-optimal arbiter D can find the correct decision boundary in feature space between high-density subsets of data. That is, the complete sample generated by the generator will make the output probability of the real category of the discriminator low. Finally, the discriminator gets the correct class boundary in the low density region.
For the above-mentioned complete generator to function, the discriminator needs to apply the unlabeled sample xuHas high true and false confidence. However, the most primitive GAN's discriminator objective function does not meet this requirement, it only needs the discriminator to satisfy
Figure BDA0001963421060000065
Go to getTo the correct classification boundary. Probability p if K ≦ KD(k | x) is uniformly distributed, the original objective function cannot enhance the classification performance of the discriminator. In order to guarantee strong true and false confidence under the optimal condition, a conditional entropy item is added in an objective function
Figure BDA0001963421060000066
The new discriminator objective function is:
Figure BDA0001963421060000071
wherein:
Figure BDA0001963421060000072
Figure BDA0001963421060000073
Figure BDA0001963421060000074
in the above formula
Figure BDA0001963421060000075
Input labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence (or probability) of each of the previous K classes and the real label y are outputlCross entropy of (d);
Figure BDA0001963421060000076
representing the input of discriminator D as unlabelled sample xuThen, outputting the cross entropy between the confidence coefficient of the front K classes and the real label y which is less than or equal to K;
Figure BDA0001963421060000077
representing the input of discriminator D as unlabelled sample xgThen the model outputs K +The confidence of class 1 and the cross entropy of a real label y which is K + 1;
Figure BDA0001963421060000078
i.e. conditional entropy terms, representing the input as unmarked samples xuAnd calculating the information entropy of the K categories according to the confidence degrees corresponding to the K categories.
In practical data mining applications, this new generator and discriminator objective function combination can generate a valid complement generator to help the discriminator find the correct classification decision boundary. In the actual training process, in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the objective function of the discriminator.
The invention discloses a method for semi-supervised learning of structured data, which constructs a semi-supervised antagonistic neural network model structure (called Embedding generic adaptive Net for short) for structured data, preprocesses original structured data X, divides the characteristics of the original data X into category (category) characteristic subset XCTSum numerical (statistical) feature subset xNLTwo parts; the original input to the model's discriminator is { x1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the samples generated by the generator, x1,xuThe included feature sets are the same, firstly, the class features of the sample are sub-set xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample xgDirectly as the input of the discriminator; generator G (z; theta)g) The system consists of three layers of fully-connected networks, BN is applied to the output of each layer to prevent gradient dispersion, and noise is used as output to obtain congestionHaving the characteristic E (x)CT)+xNLProduction sample xg
Example two
In the present embodiment, the algorithm is described, and may be developed based on any programming language and hardware environment, and is not limited by the implementation environment of the following specific implementation example.
A method for semi-supervised learning of structured data comprises the steps of constructing an EmbeddingGAN model structure suitable for structured data, preprocessing (including missing value filling, class feature numeralization processing and the like) original data X, dividing a feature set of the processed original data X into class-type feature subsets XCTSum numerical feature subset xNLTwo parts; model discriminator D (x; theta)d) (x is the input sample, θ)dThe model parameters for the discriminator, including the parameters of the Embedding layer in the figure) are given as { x }1,xu,xgIn which x1,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, class feature x is appliedCTInputting an Embedding layer (a neural network structure capable of converting input numerical values into corresponding multidimensional vectors) to obtain corresponding dense embedded vectors E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd applying Batch Normalization (BN) (Batch training mode is used for training neural network, so the technique is to normalize the characteristic value of each Batch of samples) technique to obtain normalized samples containing new characteristic set, finally inputting the normalized new samples into a discriminator to train, and generating samples xgDirectly as the input of the discriminator; generator G (z; theta)g) (z is input noise, θ)gModel parameters of a generator), which consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient diffusion which possibly occurs during neural network training; will conform to the probability distribution pz(z) noise as input, let the sample x be generatedgProbability distribution p ofGRemoving new samples processed before fitting
Figure BDA0001963421060000084
Finally learning an input noise z to the data E (x) generated by the real sample by the multi-layer perceptronCT)+xNLFinally, a generated sample x close to the true sample distribution is outputg
Further, recombining the real samples into a sample E (x) with new features through Embedding layerCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
Figure BDA0001963421060000081
wherein G represents a functional expression of the generator model, f is a functional expression of the intermediate layer output statistical properties of the discriminator D,
Figure BDA0001963421060000082
i.e. the processed real sample E (x)CT)+xNLThe statistical properties of (a) output an expectation,
Figure BDA0001963421060000083
for the input noise to pass through the generator G to generate the generated sample and then input the statistical property expectation obtained by the function f, the statistical property expectation of the generated sample is expected to be close to that of the real sample, so the parameter theta of the generator is optimizedgThis loss function is minimized.
Furthermore, in order to ensure strong true and false confidence under the optimal condition, a conditional entropy term is added in the objective function
Figure BDA00019634210600000912
The new discriminator objective function is:
Figure BDA0001963421060000091
which can be divided into supervised learning terms
Figure BDA00019634210600000910
Unsupervised learning item
Figure BDA00019634210600000911
And newly added conditional entropy terms
Figure BDA0001963421060000099
Figure BDA0001963421060000092
Figure BDA0001963421060000093
Figure BDA0001963421060000094
Wherein,
Figure BDA0001963421060000095
input labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence (or probability) of each of the previous K classes and the real label y are outputlThe desired value of the cross entropy of (a);
Figure BDA0001963421060000096
representing the input of discriminator D as unlabeled true sample xuThen, the cross entropy expectation of the sum of the confidence degrees of the first K classes (representing the classes of the real samples, wherein the real samples may have K different classes) and the real label y thereof being less than or equal to K is output;
Figure BDA0001963421060000097
the input representing the discriminator D is the generated sample xgThen, the model outputs a cross entropy expectation of the confidence of the K +1 class and the true label y ═ K +1 (the K +1 class is the class label of the generated sample);
Figure BDA0001963421060000098
i.e. the expectation of the conditional entropy term, the input is represented as unmarked sample xuAnd calculating the information entropy of the confidence degrees corresponding to the K categories.
The new generator and discriminator target function combination can be trained to obtain a complementary generator (complementary generator) to help the discriminator to find the correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the target function of the generator, and simultaneously the parameters of the discriminator are updated once to optimize the discriminator target function.
In this embodiment, the specific steps of model training are as follows:
1. based on the description of the model structure, the whole network structure is realized by using a computer language, which comprises the following steps: the generator and the discriminator.
2. The original labeled data was divided into ten using a ten-fold intersection method, and then each was used as a test set and the other nine as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one.
3. The original training data is divided into labeled data and unlabeled data, and non-numeric type class features are encoded with numeric values.
4. First, the unmarked data is copied to the same data size as the marked data. Then, the category characteristics x of the labeled data and the unlabeled data are usedCTInput deviceThe Embedding layer obtains their embedded vector E (x)CT). The embedded vector is then combined with other original features to obtain a new sample E (x)CT)+xNLThe input discriminators obtain their outputs
Figure BDA0001963421060000101
And
Figure BDA0001963421060000102
simultaneously, Gaussian random noise z is sequentially input into a generator and a discriminator to obtain statistical characteristic output of a generated sample G (Z)
Figure BDA0001963421060000103
And finally, utilizing an objective function of the discriminator:
Figure BDA0001963421060000104
and (4) performing back propagation of the gradient, and updating all the discriminators and the Embedding layer parameters.
5. Then, the noise z is input into the generator again to obtain a generated sample G (Z), and then G (Z) and E (x) obtained by subjecting the unmarked sample to Embedding layerCT)+xNLSimultaneously input into the discriminator to obtain their outputs D (G (z)) and D (E (x)CT)+xNL) Finally, using the objective function of the generator
Figure BDA0001963421060000105
Parameters of a back-propagation update generator
And repeating the steps 4 and 5 for a plurality of times until the preset maximum iteration times are reached or the objective function values of the generator and the discriminator are not obviously changed any more, and saving the parameters of the whole model to a hard disk, thereby obtaining an excellent semi-supervised classification model.
The experimental data source is 11 groups of public semi-supervised learning data sets of UCI and KEEL. FIG. 2 shows the basic information of the Data, where Data set in the table is the name of the Data, # samples is the sample size of the Data, # Features (Category/Numerical) is the feature number and feature type of the Data, and # Classes is the Category number of the Data. The data was divided into tenths using a ten-fold cross over method on the original data, and then each was taken as a test set and the other nine were taken as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one. That is, we will eventually perform control experiments on 11X 4 data sets
The experiment was performed by comparison with 6 existing semi-supervised learning algorithms to compare the performance of the two indexes, Accuracy and Cohen's Kappa. These 6 algorithms are all designed based on one or more specified base algorithms, and we will use four different base algorithms for each semi-supervised learning algorithm, respectively. That is, the invention will eventually be compared to the different algorithms in 24
The 6 comparison algorithms and their base algorithms and the algorithm specific parameter settings of the present invention are shown in figure 3,
wherein EmGAN is the algorithm of the invention, KNN, C4.5, Naive Bayes and SMO are four basic algorithms, and Democratic-Co, Self-tracing, Co-Bagging, Tritracing and DE-Tritracing are 8 semi-supervised learning algorithms.
In the embodiment, the steps described in the present embodiment are used to obtain the whole EmGAN model, and we use the discriminator model to perform class prediction on the unlabeled data and the test data set of each data set, and calculate two indexes, Accuracy and Cohen's Kappa, by combining their true class labels, and the final experimental results are shown in fig. 4 and 5.
Experimental analysis: as shown in fig. 3 and 4, the bold font indicates the algorithm that performs best on the data set, and it can be observed that the algorithm provided by the present invention performs best on the unlabeled data set in terms of Accuracy and Cohen's Kappa. On the test data set, the algorithm provided by the invention also achieves the best effect on the proportion of three marked samples of 20%, 30% and 40%. In summary, the effectiveness of the proposed method can be demonstrated.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (3)

1. A method for semi-supervised learning of structured data is characterized by constructing an Embedding GAN model structure suitable for structured data, preprocessing original data X, wherein the preprocessing comprises missing value filling and class feature digitization, and dividing a feature set of the processed original data X into class-type feature subsets XCTSum numerical feature subset xNLTwo parts; model discriminator D (x; theta)d) Is { x }l,xu,xgWhere x is the input sample, θdThe model parameters include the parameters of an Embedding layer, which is a neural network structure capable of converting input values into corresponding multidimensional vectors, xl,xuRespectively marked and unmarked data samples, xgFor the generator generated samples, the categorical features are subset xCTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)CT) Then with the numerical feature subset xNLCombining to obtain a sample E (x) with a new feature setCT)+xNLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization technology, and finally inputting the normalized new sample into a discriminator for training to generate a sample xgDirectly as the input of the discriminator; generator G (z; theta)g) Where z is the input noise, θgFor the model parameters of the generator, the generator consists of three layers of fully-connected networks, and the output of each layer applies Batch Normalization to prevent gradient diffusion which may occur during neural network training; will conform to the probability distribution pz(z) noise as input, let the sample x be generatedgProbability distribution p ofGFitting samples E (x) with new feature setCT)+xNLProbability distribution of
Figure FDA0002716937790000015
Finally learning through a multi-layer perceptronInputting noise z to sample E (x) with new characteristic setCT)+xNLFinally, a generated sample x close to the true sample distribution is outputg
2. The method of claim 1, wherein recombining real samples via an Embedding layer to form a sample E (x) with a new feature setCT)+xNLGenerator match E (x)CT)+xNLIn Embedding GAN, the loss function of the generator is:
Figure FDA0002716937790000011
wherein G represents a functional expression of the generator model, f is a functional expression of the intermediate layer output statistical properties of the discriminator D,
Figure FDA0002716937790000012
i.e. the processed sample E (x) with the new feature setCT)+xNLThe statistical properties of (a) output an expectation,
Figure FDA0002716937790000013
for the input noise, the generator G generates a generation sample and then inputs the statistical property expectation obtained by the function f, and finally the statistical property expectation of the generation sample is expected to be close to that of a real sample, so that the parameter theta of the generator is optimizedgThis loss function is minimized.
3. Method according to claim 2, characterized in that in order to guarantee strong true and false confidence under optimal conditions, the term L of conditional entropy is added to the objective functionentropyThe new discriminator objective function is:
Figure FDA0002716937790000014
which can be divided into supervised learning terms
Figure FDA0002716937790000021
Unsupervised learning item
Figure FDA0002716937790000022
And newly added conditional entropy terms
Figure FDA0002716937790000023
Figure FDA0002716937790000024
Figure FDA0002716937790000025
Figure FDA0002716937790000026
Wherein,
Figure FDA0002716937790000027
input labeled data x representing discriminator DlAnd a corresponding class label ylThen, the confidence coefficient and the real label y of each front K classes are outputlThe desired value of the cross entropy of (a);
Figure FDA0002716937790000028
the input representing the discriminator D is a unlabelled data sample xuThen, the sum of the confidence degrees of the previous K classes and the cross entropy expectation of the real label y of the sum of the confidence degrees of the previous K classes, which is less than or equal to K, are output;
Figure FDA0002716937790000029
the input representing the discriminator D is the generated sample xgModel inputThe confidence coefficient of the K +1 class and the cross entropy expectation of the real label y which is K +1 are obtained;
Figure FDA00027169377900000210
i.e. the expectation of the conditional entropy term, representing the input as unmarked data samples xuCalculating the information entropy of confidence degrees corresponding to the K categories;
the new generator and discriminator objective function combination can be trained to obtain a complementary generator to help the discriminator to find a correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the objective function of the generator, and simultaneously, the parameters of the discriminator are updated once to optimize the discriminator objective function.
CN201910091581.7A 2019-01-30 2019-01-30 Semi-supervised learning method for structured data Active CN109977094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910091581.7A CN109977094B (en) 2019-01-30 2019-01-30 Semi-supervised learning method for structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910091581.7A CN109977094B (en) 2019-01-30 2019-01-30 Semi-supervised learning method for structured data

Publications (2)

Publication Number Publication Date
CN109977094A CN109977094A (en) 2019-07-05
CN109977094B true CN109977094B (en) 2021-02-19

Family

ID=67076794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910091581.7A Active CN109977094B (en) 2019-01-30 2019-01-30 Semi-supervised learning method for structured data

Country Status (1)

Country Link
CN (1) CN109977094B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704221B (en) * 2019-09-02 2020-10-27 西安交通大学 Data center fault prediction method based on data enhancement
CN110569842B (en) * 2019-09-05 2022-08-12 江苏艾佳家居用品有限公司 Semi-supervised learning method for GAN model training
CN110738309B (en) * 2019-09-27 2022-07-12 华中科技大学 DDNN training method and DDNN-based multi-view target identification method and system
CN110719279A (en) * 2019-10-09 2020-01-21 东北大学 Network anomaly detection system and method based on neural network
CN111240279B (en) * 2019-12-26 2021-04-06 浙江大学 Confrontation enhancement fault classification method for industrial unbalanced data
CN111444959A (en) * 2020-03-26 2020-07-24 常州工业职业技术学院 Construction method of stack-type structure hierarchical classification model based on SVM
CN111949886B (en) * 2020-08-28 2023-11-24 腾讯科技(深圳)有限公司 Sample data generation method and related device for information recommendation
CN112232395B (en) * 2020-10-08 2023-10-27 西北工业大学 Semi-supervised image classification method for generating countermeasure network based on joint training
CN113505664B (en) * 2021-06-28 2022-10-18 上海电力大学 Fault diagnosis method for planetary gear box of wind turbine generator
CN113951868B (en) * 2021-10-29 2024-04-09 北京富通东方科技有限公司 Method and device for detecting man-machine asynchronism of mechanical ventilation patient

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390171A (en) * 2013-07-24 2013-11-13 南京大学 Safe semi-supervised learning method
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN107392147A (en) * 2017-07-20 2017-11-24 北京工商大学 A kind of image sentence conversion method based on improved production confrontation network
CN107590262A (en) * 2017-09-21 2018-01-16 黄国华 The semi-supervised learning method of big data analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10607319B2 (en) * 2017-04-06 2020-03-31 Pixar Denoising monte carlo renderings using progressive neural networks
US20180314932A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Graphics processing unit generative adversarial network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390171A (en) * 2013-07-24 2013-11-13 南京大学 Safe semi-supervised learning method
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN107392147A (en) * 2017-07-20 2017-11-24 北京工商大学 A kind of image sentence conversion method based on improved production confrontation network
CN107590262A (en) * 2017-09-21 2018-01-16 黄国华 The semi-supervised learning method of big data analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多源多层次信息融合的网络安全态势感知方法;邓晓衡等;《上海交通大学学报》;20150831;第49卷(第8期);第1144-1152页 *

Also Published As

Publication number Publication date
CN109977094A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109977094B (en) Semi-supervised learning method for structured data
CN110689086B (en) Semi-supervised high-resolution remote sensing image scene classification method based on generating countermeasure network
Zheng et al. A full stage data augmentation method in deep convolutional neural network for natural image classification
Canabarro et al. Unveiling phase transitions with machine learning
Tang et al. Deep safe incomplete multi-view clustering: Theorem and algorithm
Tsutsui et al. Meta-reinforced synthetic data for one-shot fine-grained visual recognition
Zhao et al. Embedding visual hierarchy with deep networks for large-scale visual recognition
Jin et al. Cold-start active learning for image classification
CN107491782B (en) Image classification method for small amount of training data by utilizing semantic space information
Tsai et al. Deep learning of topological phase transitions from entanglement aspects
CN111753995A (en) Local interpretable method based on gradient lifting tree
Hada et al. Sparse oblique decision trees: A tool to understand and manipulate neural net features
Antoran et al. Disentangling and learning robust representations with natural clustering
Yang et al. Generative counterfactuals for neural networks via attribute-informed perturbation
Zhu et al. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation
Zhang et al. Mixture distribution graph network for few shot learning
Ye et al. Self-supervised adversarial variational learning
Saleknia et al. Efficient still image action recognition by the combination of ensemble learning and knowledge distillation
Komarov Reducing the search area of genetic algorithm using neural network autoencoder
US20230031512A1 (en) Surrogate hierarchical machine-learning model to provide concept explanations for a machine-learning classifier
Cai et al. A novel deep learning approach: Stacked evolutionary auto-encoder
Zhang et al. Evolutionary computation and evolutionary deep learning for image analysis, signal processing and pattern recognition
Jing et al. NASABN: A neural architecture search framework for attention-based networks
Ju et al. A novel neutrosophic logic svm (n-svm) and its application to image categorization
Darma et al. GFF-CARVING: Graph Feature Fusion for the Recognition of Highly Varying and Complex Balinese Carving Motifs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant