CN109977094B

CN109977094B - Semi-supervised learning method for structured data

Info

Publication number: CN109977094B
Application number: CN201910091581.7A
Authority: CN
Inventors: 邓晓衡; 黄戎; 沈海澜
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2021-02-19
Anticipated expiration: 2039-01-30
Also published as: CN109977094A

Abstract

The invention discloses a method for semi-supervised learning of structured data, which comprises the steps of constructing a semi-supervised antagonistic neural network model for the structured data, preprocessing original structured data X, and dividing the characteristics of the original data X into a class type characteristic subset X_CTSum numerical feature subset x_NL(ii) a The original input to the model arbiter is { x }₁，x_u，x_gIn which x₁，x_uRespectively marked and unmarked samples, x_gFor the samples generated by the generator, x₁，x_uThe included feature sets are the same, and the class features of the sample are sub-set x_CTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)_CT) Then, the numerical feature subset x is compared with_NLCombining to obtain a sample E (x) with a new feature set_CT)+x_NLApplying BN technology to obtain normalized sample containing new characteristic set, inputting new sample into discriminator to train and generating sample x_gDirectly as the input of the discriminator; a generator consisting of three layers of fully-connected networks, wherein the output of each layer applies BN to prevent gradient dispersion, and noise is used as output to obtain a characteristic E (x)_CT)+x_NLProduction sample x_g。

Description

Semi-supervised learning method for structured data

Technical Field

The invention relates to the technical field of computers, in particular to a method for semi-supervised learning of structured data.

Background

The semi-supervised learning problem has attracted a lot of attention in many areas, such as: anomaly detection, email archiving, and the like. While the raw data in many real-world applications is structured and label-free, a supervised learning task requires a large amount of manually labeled data as a training set, and the quality of manually labeled data means more manpower, domain knowledge, and time overhead. However, semi-supervised is a learning paradigm proposed to solve this conflict, which can utilize a large amount of easily available label-free data and a small amount of artificial label data to enhance the final performance of the classifier. Essentially, this approach uses a large amount of label-free data to correct the learned hypotheses on the label data.

Currently, many different technical routes have been proposed to solve the semi-supervised learning problem. Mainly comprises the following steps: (1) self-label technique: on the premise that the classifier is more inclined to correctly predict the label, the classifier trained on a small amount of labeled data is utilized, and the prediction result with the highest confidence coefficient of each unlabeled sample is used as a corresponding new label to obtain a larger labeled data set. The method mainly comprises the following steps: famous semi-supervised learning algorithms such as Democratic-Co, Tri-tracing, Co-Bagging, Co-tracing and the like; (2) the method for generating the model and re-labeling the clusters comprises the following steps: generative models are algorithms that the field first attempts to utilize unlabeled samples. A joint probability model p (x, y) ═ p (y) p (x | y) is assumed, where p (x | y) is a known mixture distribution, such as a gaussian mixture model. This is therefore a deterministic parametric model using labeled and unlabeled data. The clustering and re-labeling method is very similar to the generated model, but is not a probabilistic model, and the general idea is to cluster the whole data set and label each cluster with the help of the labeled data; (3) taking unmarked and marked samples as Graph nodes, taking the similarity between the nodes as Graph edges, and changing the semi-supervision problem into a Graph cutting problem; (4) the Semi-supervised supported vector machines (S3VM) are used for the extension of standard SVM without labeled data, and realize the clustering assumption of Semi-supervised learning, namely samples of the same data cluster have similar labels, so that the collected unlabeled data cannot be segmented, and the classes are well separated.

However, in data mining or any type of predictive modeling, data comes before the algorithm/method. This is the main reason that machine learning requires a lot of feature engineering before some specific tasks such as anomaly detection, email archiving and many other "anomaly" data that cannot be directly entered into the SSL model described above. Many of the proposed semi-supervised learning algorithms described above also require extensive feature engineering to achieve good model performance. Instead, these types of tasks can be well accomplished with deep neural networks without any annoying and time consuming feature engineering. Most of the time, these tasks require domain knowledge, creativity and extensive trial and error. Of course, domain expertise and sophisticated feature engineering are still very valuable.

Disclosure of Invention

Aiming at the technical problems, the invention constructs a new semi-supervised learning algorithm suitable for structured data based on a countermeasure neural network (GAN), wherein the model mainly comprises a generator and a discriminator, and the discriminator aims to firstly judge whether a sample is a real sample or a generator generated sample and secondly classify the sample with class labels; the purpose of the generator is to input gaussian noise to produce a generated sample that is spurious. Through the comparison and discrimination between the generator and the discriminator, a generator capable of generating false-to-false and a discriminator with high true-to-false resolution capability are finally obtained, and meanwhile, a small amount of labeled samples are utilized to train the discriminator to obtain an excellent classifier. The whole model is composed of multiple layers of fully-connected networks, so that complex feature engineering is reduced or even avoided, and the classification performance of the model on structured data is obviously improved.

Inspired by the advantages of neural networks, based on the research of Feature Matching (FM) GANs and Bad GANs, Embedding GAN (EmGAN for short) which is modified aiming at the application scene of data mining is used in the semi-supervised learning of structured data, and a large amount of Feature engineering is saved compared with the traditional semi-supervised learning algorithm.

Structured data means you can see rows as collected data points or observations and columns as fields representing individual attributes of each observation.

The main differences between structured data and image data are: (1) unstructured data such as images are typically treated as a single entity in units of quantities such as: pixel, audio frequency, etc. The structured data contains more data types, mainly: numerical data and category data. (2) Structured data does not have a spatial correlation with pixels of image data.

As described in (1), there are a lot of category characteristics in structured data, and conventional processing methods are tag/value encoding and one-hot encoding, but these techniques have problems in terms of memory and true representation of category hierarchy. And the continuous nature of neural networks limits their application to class features. Therefore, representing class features with integer numbers and then applying neural networks directly, does not yield good results. By using the thought of algorithms such as DeepFM in the entity embedding and recommending system for reference, the method maps the class characteristics of each real sample into a high-dimensional dense space, and then inputs the class characteristics and other numerical characteristics into a discriminator for training; under the influence of (2), the GAN does not use convolution, pooling and the like for network structures on the CV, and the loss of characteristic information is avoided. Combining with the theoretical research of Feature Matching (FM) GANs and Bad GANs, the loss function combination of generators and discriminators suitable for data mining scenes is provided. And finally, compared with other semi-supervised learning algorithms applied to structured data on public data sets of UCI and KEEL.

The present invention is directed to at least solving the problems of the prior art. Therefore, the invention discloses a method for semi-supervised learning of structured data, which is characterized by constructing a semi-supervised antagonistic neural network model structure (called generalized adaptive Net, called Embedding GAN for short) for structured data, preprocessing original data X, dividing a feature set of the original data X into a class-type feature subset X_CTSum numerical feature subset x_NLTwo parts; original input sample division of model's arbiter into { x }₁，x_u，x_gIn which x₁，x_uRespectively marked and unmarked data samples, x_gFor the generator generated samples, class feature x is applied_CTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)_CT) Then with the numerical feature subset x_NLCombining to obtain a sample E (x) with a new feature set_CT)+x_NLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample x_gDirect connectionAs an input to the arbiter; generator G (z; theta)_g) The system consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient dispersion; will conform to the probability distribution p_z(z) noise as input, let the sample distribution p be generated_GFitting removal

Learning an input noise z into data E (x) by a multi-layered perceptron_CT)+x_NLFinally, a generated sample x close to the real sample is output_g(ii) a Where P (y ═ K +1| x) represents the probability that x is a false sample.

Further, recombining the real samples into a new feature sample E (x) through Embedding layer_CT)+x_NLGenerator match E (x)_CT)+x_NLIn Embedding GAN, the loss function of the generator is:

where G represents a functional expression of the generator model, the function f is an expression of the output statistical characteristic of the discriminator D,

i.e. the statistical properties of the real sample x,

the statistical properties resulting from generating samples for the input noise via generator G and then inputting function f.

Furthermore, in order to ensure strong true and false confidence under the optimal condition, a conditional entropy term is added in the objective function

The new discriminator objective function is:

wherein:

wherein,

input labeled data x representing discriminator D₁And a corresponding class label y₁Then, the confidence (or probability) of each of the previous K classes and the real label y are output_lCross entropy of (d);

representing the input of discriminator D as unlabelled sample x_uThen, outputting the cross entropy between the confidence coefficient of the front K classes and the real label y which is less than or equal to K;

representing the input of discriminator D as unlabelled sample x_gThen, the model outputs the confidence coefficient of the K +1 class and the cross entropy of the real label y which is K + 1;

i.e. conditional entropy terms, representing the input as unmarked samples x_uAnd calculating the information entropy of the K categories according to the confidence degrees corresponding to the K categories.

The new generator and discriminator objective function combination can produce a complementary generator (complementary generator) to help the discriminator find the correct classification decision boundary, and in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the discriminator objective function.

The invention has the beneficial effects that: 1. the antagonistic neural network model is applied to a semi-supervised learning task of structured data, and a better effect is achieved; 2. aiming at the characteristic that a large number of class characteristics exist in structured data, the class characteristics of each real sample are mapped into a high-dimensional dense space by using an Embedding layer, and then are input into a discriminator together with other numerical characteristics for training, so that the performance of the model can be effectively improved; 3. aiming at the application scene of the structured data, a new generator and discriminator objective function combination is provided, so that the generator can effectively generate a complete sample, the discriminator can more accurately find a decision boundary between real categories, and an excellent classifier model is obtained.

Drawings

The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a block diagram of a semi-supervised learning approach for structured data in accordance with the present invention;

FIG. 2 is basic information of experimental data in one embodiment of the invention;

FIG. 3 is a parameter set of experimental data in one embodiment of the invention;

FIG. 4 is an experimental result on an unlabeled dataset in one embodiment of the invention;

FIG. 5 is experimental results on a test data set in one embodiment of the invention.

Detailed Description

Example one

The invention mainly provides a new model Embedding GAN (EmGAN) suitable for structured data on the basis of semi-supervised GAN (semi-supervised), and the following three aspects are described in detail: the model structure, the generator and the objective function of the arbiter.

1. Model structure

The structure of the whole algorithm model is shown in fig. 1. This model is divided into three parts by dashed boxes:

A) the upper left corner is the pre-processing portion of the original data x (structured data containing class K labels), which includes labeled samples x_lAnd its label y_lSample x without label_uAnd test set samples { x_test，y_test}. As illustrated, we partition the set of features of the raw data x into a subset of categorical features x_CTSum numerical feature subset x_NLTwo parts.

B) Within the right dashed box is a six-layer fully connected network D (x; theta_d)(θ_dThe model parameters of the discriminator, including the parameters of the Embedding layer in the figure) as the discriminator, that is, the classifier finally needed to be obtained by the semi-supervised learning. The original input to the discriminator is { x }_l，x_u，x_gIn which x_gAre samples generated by a generator as will be described below. For better extraction of true samples x_l，x_uThe subset of class features x_CTBased on the potential semantic information, we first classify the feature x_CTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)_CT) Then with the numerical feature subset x_NLCombine to get a new sample feature set E (x)_CT)+x_NLAnd applying Batch Normalization (NB) techniques to obtain normalized samples containing the new feature set. Finally, inputting the new sample into a discriminator to train, and generating a sample x_gThen it is directly used as an input to the arbiter. Increasing the output dimension from K class to K +1 class of real data we will use p_model(y-K +1| x) to indicate the probability that x is a false sample.

C) The lower left corner is the generator G (z; theta_g)(θ_gTo model parameters of the generator) consisting of a three-layer fully-connected network and BN applied to the output of each layer to prevent gradient dispersion. And the division of semi-supervised FM GANs by raw data x learning generators as previously proposedCloth p_GIn contrast, here the noise parameter p is_z(z) as input, let p_GFitting removal

Learning an input noise z into data E (x) by a multi-layered perceptron_CT)+x_NLFinally, a generated sample x close to the real sample is output_gWhich is also one of the inputs to the above mentioned arbiter.

2. Objective function of generator

Feature matching is a generator objective function proposed previously to solve the instability of GAN training, and can effectively avoid over-training of a generator under the current arbiter. Whose objective function is no longer simple to maximize the generated samples as output of the arbiter at the input

But instead requires the generator's generated samples g (z) to match the statistical properties of sample x, where we simply use a discriminator to specify which statistical properties are more worth matching. Specifically, the training generator matches the output value of the middle layer of the discriminator, and the activation output of the last full-link layer is matched in EmGAN. Since the purpose of the discriminator is to find the most distinctive features under the current model between the generated samples g (z) and the real samples x, the new generator objective function is a natural choice of statistical properties. And experiments demonstrate that Feature matching is indeed very effective in cases where conventional GAN becomes unstable.

The original Feature matching loss function is:

i.e. the statistical properties of the real sample x,

When introducing a model structure, it is mentioned that in order to better express potential semantic information of class features and improve the learning performance of the whole network, we recombine real samples into new feature samples E (x) through Embedding layer_CT)+x_NLAnd we require the generator to match E (x)_CT)+x_NLSo in EmbeddingGAN, the loss function of the generator is:

3. objective function of discriminator

It was previously noted in the research literature that when a perfect generator is available (i.e., the generator produces samples that are not distinguished from true samples), the probability that the discriminator adds a class K +1 to represent false samples does not help to improve generalization performance. So in theory, a compensation generator is trained to assist in producing an excellent semi-supervised learning classifier. When G is a generator that can produce a complete sample, ideally a near-optimal arbiter D can find the correct decision boundary in feature space between high-density subsets of data. That is, the complete sample generated by the generator will make the output probability of the real category of the discriminator low. Finally, the discriminator gets the correct class boundary in the low density region.

For the above-mentioned complete generator to function, the discriminator needs to apply the unlabeled sample x_uHas high true and false confidence. However, the most primitive GAN's discriminator objective function does not meet this requirement, it only needs the discriminator to satisfy

Go to getTo the correct classification boundary. Probability p if K ≦ K_D(k | x) is uniformly distributed, the original objective function cannot enhance the classification performance of the discriminator. In order to guarantee strong true and false confidence under the optimal condition, a conditional entropy item is added in an objective function

The new discriminator objective function is:

wherein:

in the above formula

Input labeled data x representing discriminator D_lAnd a corresponding class label y_lThen, the confidence (or probability) of each of the previous K classes and the real label y are output_lCross entropy of (d);

representing the input of discriminator D as unlabelled sample x_gThen the model outputs K +The confidence of class 1 and the cross entropy of a real label y which is K + 1;

In practical data mining applications, this new generator and discriminator objective function combination can generate a valid complement generator to help the discriminator find the correct classification decision boundary. In the actual training process, in each model iteration, the parameters of the generator are updated once or twice through the objective function of the generator, and the parameters of the discriminator are updated once to optimize the objective function of the discriminator.

The invention discloses a method for semi-supervised learning of structured data, which constructs a semi-supervised antagonistic neural network model structure (called Embedding generic adaptive Net for short) for structured data, preprocesses original structured data X, divides the characteristics of the original data X into category (category) characteristic subset X_CTSum numerical (statistical) feature subset x_NLTwo parts; the original input to the model's discriminator is { x₁，x_u，x_gIn which x₁，x_uRespectively marked and unmarked data samples, x_gFor the samples generated by the generator, x₁，x_uThe included feature sets are the same, firstly, the class features of the sample are sub-set x_CTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)_CT) Then with the numerical feature subset x_NLCombining to obtain a sample E (x) with a new feature set_CT)+x_NLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization (BN) technology, and finally inputting the new sample into a discriminator for training to generate a sample x_gDirectly as the input of the discriminator; generator G (z; theta)_g) The system consists of three layers of fully-connected networks, BN is applied to the output of each layer to prevent gradient dispersion, and noise is used as output to obtain congestionHaving the characteristic E (x)_CT)+x_NLProduction sample x_g。

Example two

In the present embodiment, the algorithm is described, and may be developed based on any programming language and hardware environment, and is not limited by the implementation environment of the following specific implementation example.

A method for semi-supervised learning of structured data comprises the steps of constructing an EmbeddingGAN model structure suitable for structured data, preprocessing (including missing value filling, class feature numeralization processing and the like) original data X, dividing a feature set of the processed original data X into class-type feature subsets X_CTSum numerical feature subset x_NLTwo parts; model discriminator D (x; theta)_d) (x is the input sample, θ)_dThe model parameters for the discriminator, including the parameters of the Embedding layer in the figure) are given as { x }₁，x_u，x_gIn which x₁，x_uRespectively marked and unmarked data samples, x_gFor the generator generated samples, class feature x is applied_CTInputting an Embedding layer (a neural network structure capable of converting input numerical values into corresponding multidimensional vectors) to obtain corresponding dense embedded vectors E (x)_CT) Then with the numerical feature subset x_NLCombining to obtain a sample E (x) with a new feature set_CT)+x_NLAnd applying Batch Normalization (BN) (Batch training mode is used for training neural network, so the technique is to normalize the characteristic value of each Batch of samples) technique to obtain normalized samples containing new characteristic set, finally inputting the normalized new samples into a discriminator to train, and generating samples x_gDirectly as the input of the discriminator; generator G (z; theta)_g) (z is input noise, θ)_gModel parameters of a generator), which consists of three layers of fully-connected networks, and BN is applied to the output of each layer to prevent gradient diffusion which possibly occurs during neural network training; will conform to the probability distribution p_z(z) noise as input, let the sample x be generated_gProbability distribution p of_GRemoving new samples processed before fitting

Finally learning an input noise z to the data E (x) generated by the real sample by the multi-layer perceptron_CT)+x_NLFinally, a generated sample x close to the true sample distribution is output_g。

Further, recombining the real samples into a sample E (x) with new features through Embedding layer_CT)+x_NLGenerator match E (x)_CT)+x_NLIn Embedding GAN, the loss function of the generator is:

wherein G represents a functional expression of the generator model, f is a functional expression of the intermediate layer output statistical properties of the discriminator D,

i.e. the processed real sample E (x)_CT)+x_NLThe statistical properties of (a) output an expectation,

for the input noise to pass through the generator G to generate the generated sample and then input the statistical property expectation obtained by the function f, the statistical property expectation of the generated sample is expected to be close to that of the real sample, so the parameter theta of the generator is optimized_gThis loss function is minimized.

The new discriminator objective function is:

which can be divided into supervised learning terms

Unsupervised learning item

And newly added conditional entropy terms

Wherein,

input labeled data x representing discriminator D_lAnd a corresponding class label y_lThen, the confidence (or probability) of each of the previous K classes and the real label y are output_lThe desired value of the cross entropy of (a);

representing the input of discriminator D as unlabeled true sample x_uThen, the cross entropy expectation of the sum of the confidence degrees of the first K classes (representing the classes of the real samples, wherein the real samples may have K different classes) and the real label y thereof being less than or equal to K is output;

the input representing the discriminator D is the generated sample x_gThen, the model outputs a cross entropy expectation of the confidence of the K +1 class and the true label y ═ K +1 (the K +1 class is the class label of the generated sample);

i.e. the expectation of the conditional entropy term, the input is represented as unmarked sample x_uAnd calculating the information entropy of the confidence degrees corresponding to the K categories.

The new generator and discriminator target function combination can be trained to obtain a complementary generator (complementary generator) to help the discriminator to find the correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the target function of the generator, and simultaneously the parameters of the discriminator are updated once to optimize the discriminator target function.

In this embodiment, the specific steps of model training are as follows:

1. based on the description of the model structure, the whole network structure is realized by using a computer language, which comprises the following steps: the generator and the discriminator.

2. The original labeled data was divided into ten using a ten-fold intersection method, and then each was used as a test set and the other nine as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one.

3. The original training data is divided into labeled data and unlabeled data, and non-numeric type class features are encoded with numeric values.

4. First, the unmarked data is copied to the same data size as the marked data. Then, the category characteristics x of the labeled data and the unlabeled data are used_CTInput deviceThe Embedding layer obtains their embedded vector E (x)_CT). The embedded vector is then combined with other original features to obtain a new sample E (x)_CT)+x_NLThe input discriminators obtain their outputs

And

simultaneously, Gaussian random noise z is sequentially input into a generator and a discriminator to obtain statistical characteristic output of a generated sample G (Z)

And finally, utilizing an objective function of the discriminator:

and (4) performing back propagation of the gradient, and updating all the discriminators and the Embedding layer parameters.

5. Then, the noise z is input into the generator again to obtain a generated sample G (Z), and then G (Z) and E (x) obtained by subjecting the unmarked sample to Embedding layer_CT)+x_NLSimultaneously input into the discriminator to obtain their outputs D (G (z)) and D (E (x)_CT)+x_NL) Finally, using the objective function of the generator

Parameters of a back-propagation update generator

And repeating the

steps

4 and 5 for a plurality of times until the preset maximum iteration times are reached or the objective function values of the generator and the discriminator are not obviously changed any more, and saving the parameters of the whole model to a hard disk, thereby obtaining an excellent semi-supervised classification model.

The experimental data source is 11 groups of public semi-supervised learning data sets of UCI and KEEL. FIG. 2 shows the basic information of the Data, where Data set in the table is the name of the Data, # samples is the sample size of the Data, # Features (Category/Numerical) is the feature number and feature type of the Data, and # Classes is the Category number of the Data. The data was divided into tenths using a ten-fold cross over method on the original data, and then each was taken as a test set and the other nine were taken as training sets. The original data is changed into ten data containing different training test sets, and then each training set is divided into marked data and non-marked data according to four proportions of 10%, 20%, 30% and 40%. Such as: assuming we have 100 original samples, we first divide the data into ten, we use 9 as training set for each training, when the labeling scale is 10%, we use 10 samples and their labels as input, and the other 90 samples have their labels removed as input of unlabeled samples, and perform performance evaluation on the model on the remaining one. That is, we will eventually perform control experiments on 11X 4 data sets

The experiment was performed by comparison with 6 existing semi-supervised learning algorithms to compare the performance of the two indexes, Accuracy and Cohen's Kappa. These 6 algorithms are all designed based on one or more specified base algorithms, and we will use four different base algorithms for each semi-supervised learning algorithm, respectively. That is, the invention will eventually be compared to the different algorithms in 24

The 6 comparison algorithms and their base algorithms and the algorithm specific parameter settings of the present invention are shown in figure 3,

wherein EmGAN is the algorithm of the invention, KNN, C4.5, Naive Bayes and SMO are four basic algorithms, and Democratic-Co, Self-tracing, Co-Bagging, Tritracing and DE-Tritracing are 8 semi-supervised learning algorithms.

In the embodiment, the steps described in the present embodiment are used to obtain the whole EmGAN model, and we use the discriminator model to perform class prediction on the unlabeled data and the test data set of each data set, and calculate two indexes, Accuracy and Cohen's Kappa, by combining their true class labels, and the final experimental results are shown in fig. 4 and 5.

Experimental analysis: as shown in fig. 3 and 4, the bold font indicates the algorithm that performs best on the data set, and it can be observed that the algorithm provided by the present invention performs best on the unlabeled data set in terms of Accuracy and Cohen's Kappa. On the test data set, the algorithm provided by the invention also achieves the best effect on the proportion of three marked samples of 20%, 30% and 40%. In summary, the effectiveness of the proposed method can be demonstrated.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A method for semi-supervised learning of structured data is characterized by constructing an Embedding GAN model structure suitable for structured data, preprocessing original data X, wherein the preprocessing comprises missing value filling and class feature digitization, and dividing a feature set of the processed original data X into class-type feature subsets X_CTSum numerical feature subset x_NLTwo parts; model discriminator D (x; theta)_d) Is { x }_l,x_u,x_gWhere x is the input sample, θ_dThe model parameters include the parameters of an Embedding layer, which is a neural network structure capable of converting input values into corresponding multidimensional vectors, x_l,x_uRespectively marked and unmarked data samples, x_gFor the generator generated samples, the categorical features are subset x_CTInputting Embedding layer to obtain corresponding dense Embedding vector E (x)_CT) Then with the numerical feature subset x_NLCombining to obtain a sample E (x) with a new feature set_CT)+x_NLAnd obtaining a normalized sample containing a new feature set by applying a Batch Normalization technology, and finally inputting the normalized new sample into a discriminator for training to generate a sample x_gDirectly as the input of the discriminator; generator G (z; theta)_g) Where z is the input noise, θ_gFor the model parameters of the generator, the generator consists of three layers of fully-connected networks, and the output of each layer applies Batch Normalization to prevent gradient diffusion which may occur during neural network training; will conform to the probability distribution p_z(z) noise as input, let the sample x be generated_gProbability distribution p of_GFitting samples E (x) with new feature set_CT)+x_NLProbability distribution of

Finally learning through a multi-layer perceptronInputting noise z to sample E (x) with new characteristic set_CT)+x_NLFinally, a generated sample x close to the true sample distribution is output_g。

2. The method of claim 1, wherein recombining real samples via an Embedding layer to form a sample E (x) with a new feature set_CT)+x_NLGenerator match E (x)_CT)+x_NLIn Embedding GAN, the loss function of the generator is:

i.e. the processed sample E (x) with the new feature set_CT)+x_NLThe statistical properties of (a) output an expectation,

for the input noise, the generator G generates a generation sample and then inputs the statistical property expectation obtained by the function f, and finally the statistical property expectation of the generation sample is expected to be close to that of a real sample, so that the parameter theta of the generator is optimized_gThis loss function is minimized.

3. Method according to claim 2, characterized in that in order to guarantee strong true and false confidence under optimal conditions, the term L of conditional entropy is added to the objective function_entropyThe new discriminator objective function is:

which can be divided into supervised learning terms

Unsupervised learning item

And newly added conditional entropy terms

Wherein,

input labeled data x representing discriminator D_lAnd a corresponding class label y_lThen, the confidence coefficient and the real label y of each front K classes are output_lThe desired value of the cross entropy of (a);

the input representing the discriminator D is a unlabelled data sample x_uThen, the sum of the confidence degrees of the previous K classes and the cross entropy expectation of the real label y of the sum of the confidence degrees of the previous K classes, which is less than or equal to K, are output;

the input representing the discriminator D is the generated sample x_gModel inputThe confidence coefficient of the K +1 class and the cross entropy expectation of the real label y which is K +1 are obtained;

i.e. the expectation of the conditional entropy term, representing the input as unmarked data samples x_uCalculating the information entropy of confidence degrees corresponding to the K categories;

the new generator and discriminator objective function combination can be trained to obtain a complementary generator to help the discriminator to find a correct classification decision boundary, and in each model iteration, the model parameters of the generator are updated once or twice through the objective function of the generator, and simultaneously, the parameters of the discriminator are updated once to optimize the discriminator objective function.