CN113806535A

CN113806535A - Method and device for improving classification model performance by using label-free text data samples

Info

Publication number: CN113806535A
Application number: CN202111045781.2A
Authority: CN
Inventors: 唐杰; 罗干
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-12-17

Abstract

The application provides a method for improving the performance of a classification model by using unlabeled text data samples, which comprises the following steps: acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model; initializing parameters of the supervised learning model, and determining a first disturbance probability and a second disturbance probability; and (3) repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model finally finished with the training. By adopting the scheme, random disturbance can be added through the text characters and the model randomization structure, so that the label-free samples participate in training at the same time, and the performance of the classification model is improved.

Description

Method and device for improving classification model performance by using label-free text data samples

Technical Field

The application relates to the technical field of classification models, in particular to a method and a device for improving the performance of a classification model by using unlabeled text data samples.

Background

The classification model is one of two major classes of neural network models, the other being a regression model. The task of the classification model is to analyze input data and judge the type of an input sample from a plurality of candidate targets; in contrast, regression models require neural network fitting to obtain the target values. Classification models are more common than regression models when used in some practical projects because classification models are easier to evaluate results and apply. Text data is a common input data type of classification models in a specific field (such as data mining, natural language processing and other fields), and classification tasks such as spam classification, text emotion analysis, entity classification and the like are included. When the classification model is used, the classification model can not be separated from the supervision information provided by a certain number of labeled samples, so that the model can know the sample distribution corresponding to each candidate classification target, and when the classification model is applied, the acquisition cost of the labeled samples is far higher than that of the unlabeled samples. The academic data set is clean, the data set of the classification task is uniformly composed of labeled samples, or label-free samples are specially designed for exploring a method for unsupervised learning; the business data has more noise and sufficient non-label data, but the labeled data is few, the cost of manual labeling is high, and the existing method is a waste of resources because non-label samples are ignored.

The existing model design method is often used for solving the classification task, the content composition of a data set is idealized, the academic conventional thought is that only labeled samples are used for supervised learning, and although the classification model for the supervised learning is easy to design, sufficient labeled samples are required for training; in order to use the unlabeled exemplars, the model design needs to analyze the applied scene, carefully design the semi-supervised/unsupervised learning model, have a large difference from the supervised learning model in terms of the method, and may also need a great number of unlabeled exemplars. With the development of pre-training technology, large-scale neural network models are used in the fields of texts, images and the like to learn and obtain hidden features in mass data, such as BERT models in the text field, and the hidden features can be used as a part of a classification model to complete classification tasks through a feature extraction or fine-tuning method.

In the entity classification task, each entity sample has an entity name and a context description of the entity, and the type of the entity needs to be judged. The entity classification model proposed in Shimaoka is a classical supervised learning classification model, which encodes text data into a final vector representation through attention mechanism and pooling operation, and then gives a prediction result of classification through a full connection layer. Many similar supervised learning methods have low mental difficulty in design, and are easy to provide an available classification model according to scene requirements, but if a data set lacks enough labeled data, an ideal training effect cannot be achieved theoretically. The Ling article proposes an unsupervised (self-supervised) method, which learns the vector representation of an entity by differentiating different entities through a model by comparing the learning ideas, and then applies to downstream tasks such as entity classification. A completely new unsupervised method will give up the ingenuity of its design when the academic papers are published, but the practical application may not have enough unlabeled data to meet the training requirements. The Mishra article shows a semi-supervised approach to named entity recognition, since named entities are continuous text characters that can be used to constrain the classification output of the model using CRF techniques. It can be seen that the traditional semi-supervised learning method needs to be combined with the characteristics of the classification task itself, and is more difficult to be generally applied.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the first objective of the present application is to provide a method for improving the performance of a classification model by using unlabeled text data samples, which solves the technical problems that the existing method needs to be sufficient with labeled samples for training and neglects unlabeled samples, and simultaneously solves the problem that the existing method uses a semi-supervised/unsupervised learning model designed by using the unlabeled samples that may not have enough unlabeled data to meet the training requirements, utilizes the property that sample classification is kept consistent, adds reasonable perturbation input without changing the result of sample classification, improves the original supervised learning model, enables the unlabeled samples to participate in training at the same time, improves the robustness of the model, randomly sets the characters in the text data of the samples as 0-numbered characters with small probability without influencing the expression of text meaning, effectively utilizes the unlabeled samples, adds random perturbation through text characters and a model randomization structure, the performance of the classification model is improved, and meanwhile, the method and the device do not need a sufficient number of labeled samples or unlabeled samples and are not limited to specific model structures and classification tasks.

A second object of the present application is to provide an apparatus for improving the representation of a classification model by using unlabeled text data samples.

In order to achieve the above object, a first embodiment of the present application provides a method for promoting a classification model representation by using unlabeled text data samples, including: acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to predict the type of an input sample by using the classification type set of the classification task, and the supervised learning model is generated by training by using the labeled samples; initializing parameters of a supervised learning model, and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a No. 0 character, the No. 0 character represents a corresponding fixed and unchangeable all-0 variable, and the second disturbance probability represents the probability of a randomized layer in the model; and (3) repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model finally finished with the training.

Optionally, in an embodiment of the present application, training the supervised learning model includes:

randomly sampling a batch of data from a labeled sample set and an unlabeled sample set respectively to serve as training data;

the loss function is calculated using the training data and the update parameter gradients are propagated back.

Optionally, in an embodiment of the present application, the loss function is expressed as:

L_merge＝L_labeled+λL_unlabeled

wherein L is_labeledA loss function, L, representing a set of labeled exemplars_unlabeledA loss function representing a set of unlabeled exemplars, the loss function weight of the unlabeled exemplar portion being controlled using a parameter λ.

Optionally, in an embodiment of the present application, the loss function of the labeled exemplar set is the same as the loss function in supervised learning, and the loss function of the unlabeled exemplar set is a distance between the first probability distribution and the second probability distribution.

Optionally, in an embodiment of the present application, the first probability distribution is a probability distribution given by the model when no disturbance is added to unlabeled samples in the fixed training data, and the second probability distribution is a probability distribution given by the model after disturbance is added to the unlabeled samples in the training data.

Optionally, in an embodiment of the present application, if the classification task is multi-label classification, the loss function of the labeled sample set is binary cross entropy, and the loss function of the unlabeled sample set is expressed as:

wherein, B₂For unlabeled samples in the training data, I () represents an indicator function that returns whether the condition for each position of the vector in brackets is true, p_s,cThe class c corresponding to each position of the representing prediction vector is judged, beta is a constant threshold value used for judging whether the prediction result given by the model is reliable or not, the model discards unreliable unlabeled samples through beta, the distance between two probability distributions is calculated as a vector dot product, and p is a distance between two probability distributions_s,εIs the predicted probability, p ', of sample s after the addition of perturbation'_sThe target prediction probability for sample s after polarization prediction,

when the multi-label classification model outputs a prediction result, an original prediction result is calculated firstly, the prediction result is mapped to a range from 0 to 1 through a sigmoid function, the probability that a sample has a specific type is obtained through calculation, the sigmoid function acts on the model to calculate the target prediction probability, and the calculation formula of the sigmoid function is as follows:

wherein x represents an original prediction result obtained by model calculation, 0< tau > is equal to or less than 1, the temperature constant is used for controlling the polarization degree, and the closer to 0, the sigmoid function output value is to 0 or 1, and the higher the polarization degree is.

Optionally, in an embodiment of the present application, if the classification task is a multi-class classification, the loss function of the labeled sample set is cross entropy, and the loss function of the unlabeled sample set is expressed as:

wherein, B₂For unlabeled samples in training data, I () represents an indication function, whether the condition in brackets is a true numerical value or not is returned, beta is a constant threshold value used for judging whether the prediction result given by the model is reliable or not, the model discards unreliable unlabeled samples through beta, and p_s,εIs the predicted probability, max (p), of the sample s after adding the perturbation_s) Represents the maximum element value, p'_sThe target prediction probability for sample s after polarization prediction,

the prediction result vector output by the multi-class classification model is that a sample has probability distribution of each type, an original prediction result is calculated firstly, the predicted probability distribution is obtained through a softmax function, polarization acts on the softmax function of the model for calculating the target prediction probability, and the calculation formula of the softmax function is as follows:

wherein x is_iRepresenting the type of the sample obtained by model calculationi, k represents the number of possible classes of the sample, 0<Tau is equal to or less than 1, is a temperature constant, controls the degree of polarization, and the closer to 0, the output value of the softmax function is to 0 or 1, and the higher the degree of polarization is.

Optionally, in an embodiment of the present application, the preset condition is: and (3) evaluating the performance of the supervised learning model after each round of training on a verification set by adopting an early-stopping mechanism, and finishing the training if the performance of the evaluation index is unchanged and better in the evaluation of the continuous set times.

To achieve the above object, an embodiment of the second aspect of the present application proposes a method, including: an acquisition module, an initialization module and a training module, wherein,

the acquisition module is used for acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to predict the type of an input sample by using the classification type set of the classification task, and the supervised learning model is generated by training by using the labeled samples;

the initialization module is used for carrying out parameter initialization on the supervised learning model and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a 0-number character, the 0-number character represents a full 0 variable which is correspondingly fixed and unchanged, and the second disturbance probability represents the probability of a randomized layer in the model;

and the training module is used for repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model finally finished by the training.

The method and the device for improving the performance of the classification model by using the unlabeled text data sample solve the technical problems that the prior method needs to be sufficient for training with the labeled sample and neglects the unlabeled sample, simultaneously solve the problem that the semi-supervised/unsupervised learning model designed by using the unlabeled sample in the prior method possibly does not have sufficient unlabeled data to meet the training requirement, utilize the property that the sample classification keeps consistent, add reasonable disturbance input to not change the result of the sample classification, improve the prior supervised learning model, enable the unlabeled sample to participate in the training at the same time, improve the robustness of the model, randomly set the characters in the text data of the sample as the number 0 characters with small probability, basically not influence the expression of the text meaning, effectively utilize the unlabeled sample, add random disturbance through the text characters and the model randomization structure, the performance of the classification model is improved, and meanwhile, the method and the device do not need a sufficient number of labeled samples or unlabeled samples and are not limited to specific model structures and classification tasks.

According to the method, the label-free sample is sampled simultaneously during training, the characters in the text data of the sample are set as 0-number characters (corresponding to the fixed and unchangeable full 0 vector), the randomization operation (Dropout and the like) contained in the model is combined, the prediction probability distribution of the sample with noise disturbance is obtained, and the property of the sample is unchanged, so that the probability distribution of the noise disturbance is required to be consistent with the probability distribution of normal output.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart illustrating a method for improving a classification model representation by using unlabeled text data samples according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating an overall method for improving a classification model representation using unlabeled text data samples according to an embodiment of the present disclosure;

FIG. 3 is an illustration of a supervised learning model M of a method for improving a classification model performance using unlabeled text data samples according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for improving a classification model representation by using unlabeled text data samples according to a second embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a method and an apparatus for promoting a classification model representation by using unlabeled text data samples according to an embodiment of the present application with reference to the drawings.

Fig. 1 is a flowchart of a method for improving a classification model representation by using unlabeled text data samples according to an embodiment of the present disclosure.

As shown in fig. 1, the method for promoting the representation of the classification model by using the unlabeled text data sample includes the following steps:

101, acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to predict the type of an input sample by using the classification type set of the classification task, and the supervised learning model is generated by training with the labeled samples;

102, initializing parameters of a supervised learning model, and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a 0-number character, the 0-number character represents a corresponding fixed and unchangeable all-0 variable, and the second disturbance probability represents the probability of a randomized layer in the model;

and 103, repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation indexes are better in performance, stopping the training when the repeated training reaches the preset conditions, and outputting the model finally finished with the training.

The method for improving the performance of the classification model by using the unlabelled text data samples comprises the steps of obtaining a labeled sample set, an unlabelled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to predict the types of input samples by using the classification type set of the classification task, and the supervised learning model is generated by using the labeled samples for training; initializing parameters of a supervised learning model, and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a No. 0 character, the No. 0 character represents a corresponding fixed and unchangeable all-0 variable, and the second disturbance probability represents the probability of a randomized layer in the model; and (3) repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model finally finished with the training. Therefore, the technical problems that the existing method needs sufficient labeled samples for training and ignores the unlabeled samples can be solved, the problem that a semi-supervised/unsupervised learning model designed by using the unlabeled samples in the existing method possibly does not have sufficient unlabeled data to meet the training requirement can be solved, the property that the sample classification is kept consistent is utilized, the input of reasonable disturbance is added to not change the result of sample classification, the original supervised learning model is improved, the unlabeled samples participate in training at the same time, the robustness of the model is improved, the expression of text meaning is not influenced basically by randomly setting characters in the text data of the samples to be the number 0 characters with small probability, the unlabeled samples are effectively utilized, the random disturbance is added through text characters and a model structure, the expression of the classification model is improved, and meanwhile, the application does not need a sufficient number of labeled samples or unlabeled samples, nor to a specific model structure and classification task.

According to the method, the unlabeled sample is sampled simultaneously during training, the characters in the text data of the sample are set as the number 0 characters randomly, the randomization operation (Dropout and the like) contained in the model is combined, the prediction probability distribution of the sample with noise disturbance is obtained, and the probability distribution of the noise disturbance is required to be consistent with the probability distribution of normal output due to the fact that the sample is unchanged in nature.

The classification task can be divided into a multi-label classification (multi-label classification) and a multi-class classification (multi-class classification) according to the number of the positive example classification classes of the sample, wherein the multi-label classification refers to that the sample can have an indefinite number of positive example classification classes, and the multi-class classification only has 1 positive example classification class. The neural network mainly uses binary cross entropy (entropy-entropy) as a loss function when performing multi-label classification, and mainly uses cross entropy (entropy-entropy) as a loss function when performing multi-class classification.

The problem to be solved can be formally defined as providing a set S of labeled exemplars_labeledUnlabeled sample set S_unlabeledAnd C, a target classification category set of the classification task is { C ═ C₁,…,c_kIn the method, input data contained in samples is mainly text, and a sample s is provided with a text set T_s＝{t₁,…,t_mFor labeled samples S ∈ S_labeledWill give it the correct classification category vector y_s，y_sEach position of (a) indicates by 0/1 whether the sample has a target classification (multi-label classification may have more than one 1, multi-class classification has one and only one 1). At the same time, a model M only using the labeled samples for supervised learning is provided, various data of the samples s are received as input, and the probability vector p of each classification category of the samples predicted by the model is obtained after passing through different layers of the model_s. Training of the model by optimizing y_sAnd p_sIs performed by the loss function of; the model is evaluated by p on a test set of labeled specimens_sGenerating sample predictionsAnd then calculated using indices such as F1-score.

Taking the entity classification task as an example, the labeled sample set is n in total_labeledSample, unlabeled sample set n in total_unlabeledFor each sample, there are k classes of sample entities (i.e., y)_sAnd p_sEqual prediction vector dimension k).

At present, tools for realizing neural network codes, such as PyTorch, TensorFlow and the like, automatically initialize each parameter of a model according to a certain rule when a neural network is created. Because the structure of the model is not changed, the method for initializing the parameters of the neural network model M for supervised learning is consistent with the method for only carrying out supervised learning on the current classification task, the model can be automatically initialized by using a tool, and partial parameters of the model can be initialized by using other probability distribution according to the characteristics of the classification task. The method for adding the disturbance respectively acts on the text data of the sample and the randomized structure (Dropout layer and the like) of the model, the text data of the sample can randomly set the input text characters as 0-number characters, and the disturbance probability p₁Representing the probability that a character is set to the 0 character, p₁Taking a small value (e.g., 0.1) avoids the text semantics being unduly corrupted. For the classification task, some key characters in the text data play a decisive role, such as a text segment of a 'fish-flavor shredded meat' sample, wherein four characters in all of 'Sichuan' and 'Chuancheng' directly indicate that the sample entity is a Sichuan cuisine, the probability of the simultaneous replacement of the four characters is one ten thousandth, and the probability of the text semantic excessive damage is theoretically low. Probability of disturbance p₂For stochastic layers such as Dropout layers in the model, it is also possible to set different probabilities for each stochastic layer, since supervised learning also has these stochastic layers, p₂The value of (a) is aimed at improving the performance of supervised learning, generally p₂Less than or equal to 0.5 (such as 0.2).

A small batch of samples are normally sampled in each step of the supervised learning model during training, a loss function is calculated through forward propagation, and then the gradient is updated through backward propagation to obtain new parameter weight. Currently optimizers of neural networks, such as Adam, will record gradient values of parameter historiesCumulatively affect the gradient update of the current step, so a fixed sample configuration should be used at each step optimization. The unlabeled exemplars of the present application also participate in training, so each step requires a set S of labeled exemplars from which to derive_labeledMiddle sampling a small batch of labeled samples B₁From unlabeled sample set S at the same time_unlabeledMiddle sampling a small batch of unlabeled samples B₂. In code implementation, B is sampling without putting back because of circularly making₁、B₂The sampling times of each cycle are different, and different data loaders (data loaders) are required to be used separately. Since random perturbations introduce additional randomization disturbances to the gradient computation, B₂Should be larger than B₁Such as | B₂|＝2|B₁To reduce the impact of extreme conditions.

For B₁And B₂The input of which is preprocessed to become model-acceptable, wherein the text is converted to a character index to obtain a corresponding trainable word vector, and the number 0 character as a special character corresponds to a fixed and unchangeable full 0 vector. B is₂The text of the middle sample is kept unchanged when no disturbance is added, and the text characters are kept unchanged with probability p when disturbance is added₁And randomly setting the characters to be 0.

Further, in the embodiment of the present application, the training of the supervised learning model includes the following steps:

Determining the probability of disturbance p by initializing the parameters of the supervised learning neural network model M with default initialization or, if desired, with other probability distributions₁、p₂A numerical value; sampling a small batch of labeled and unlabeled samples B₁、B₂(ii) a Calculating a loss function, and reversely propagating and updating the parameter gradient; judging whether the number of training rounds is enough; and keeping the model parameters with the best evaluation indexes.

Further, in the embodiments of the present application, the loss function is expressed as:

L_merge＝L_labeled+λL_unlabeled

The method mainly aims to use the label-free samples in the training process, and the loss function calculation mode of each step of the model should be fixed due to the accumulation of the historical gradient of the parameter. The training of each step of the neural network generally comprises the steps of calculating a loss function and then reversely propagating the gradient of the updated parameter, and the application simultaneously uses unlabeled samples, and the loss function of the unlabeled samples is represented by labeled samples B₁And unlabeled sample B₂Are combined to obtain the two different loss functions.

Further, in the embodiment of the present application, the loss function of the labeled exemplar set is the same as the loss function in supervised learning, and the loss function of the unlabeled exemplar set is the distance between the first probability distribution and the second probability distribution.

The calculation of the loss function is divided into two conditions of multi-label classification and multi-class classification, and the part of the label-free sample also uses three technologies of confidence-based hiding, polarization prediction and warm-up training (warmup) to further improve the training effect.

Confidence-based masking: unreliable unlabeled samples were discarded with β, which was adjusted according to data quality, and could take 0.3. The output of the model is a prediction probability vector p of the sample on the classification task_sIf the classification condition of the samples is not determined by the model (corresponding probability distribution high entropy and numerical values of all positions are close), the target prediction probability without disturbance is unreliable; polarization prediction: polarising the prediction probability vector p by tau_sOf value (d) in calculating p_sThe previous softmax or sigmoid functions, which make them closer to 0/1, act like a soft target in knowledge-based distillation; warming up training: the parameter λ controls the weight of the loss function for the unlabeled sample portion, starting from 0 and starting at a certain number of steps (e.g., 100)Step 0) becomes 1 since the undisturbed target probability distribution of the unlabeled exemplars is not reliable in itself at the beginning.

Further, in this embodiment of the application, the first probability distribution is a probability distribution given by the model when no disturbance is added to the unlabeled samples in the fixed training data, and the second probability distribution is a probability distribution given by the model after disturbance is added to the unlabeled samples in the training data.

The loss function of the unlabeled exemplars is the distance of two probability distributions, the first probability distribution being a fixed B₂The probability distribution given by the model when no disturbance is added to the sample (the model avoids the randomization operation such as Dropout layer from being effective in eval state), and the second probability distribution is that the probability distribution given after the disturbance is added to the sample is as close as possible to the target probability distribution when no disturbance is added. The distance between the two probability distributions can be calculated by using measurement functions such as mean square error and KL divergence.

Further, in this embodiment of the present application, if the classification task is multi-label classification, the loss function of the labeled sample set is binary cross entropy, and the loss function of the unlabeled sample set is expressed as:

Further, in this embodiment of the present application, if the classification task is a multi-class classification, the loss function of the labeled sample set is cross entropy, and the loss function of the unlabeled sample set is expressed as:

wherein x is_iRepresenting samples calculated by modelPrediction result on type i, k represents the number of possible classification types of the sample, 0<Tau is equal to or less than 1, is a temperature constant, controls the degree of polarization, and the closer to 0, the output value of the softmax function is to 0 or 1, and the higher the degree of polarization is.

Further, in the embodiment of the present application, the preset conditions are: and (3) evaluating the performance of the supervised learning model after each round of training on a verification set by adopting an early-stopping mechanism, and finishing the training if the performance of the evaluation index is unchanged and better in the evaluation of the continuous set times.

The training of the neural network needs to sample a plurality of step samples to calculate the back propagation gradient and then update the parameters so as to finally converge to the local optimal solution or overfitting, and therefore, whether the number of training rounds is enough or not is judged, and a larger maximum training step number (such as 10000 steps) needs to be set or the maximum training step number is not limited. And meanwhile, an early stop mechanism (early stop) is adopted, the performance of the model is evaluated on the verification set after every several steps, the training is finished if the performance of the evaluation index is not better for t times continuously, and t is a smaller value (for example, t is 8), so that the phenomenon that too many extra steps increase the training time after overfitting is avoided, and the phenomenon that a certain local optimal solution is not trained due to too few steps is also avoided.

The model can evaluate the performance on the verification set for many times in the training process, and when the evaluation index performance is better, the current model parameters are recorded; and after the final training is finished, the optimal parameters of the model are the model parameters stored at the time with the best evaluation index, and the model after the final training is finished is output.

A small batch of unlabeled samples B additionally sampled in the training process of the application₂In the scale of labeled sample B₁Constant multiple of (B)₂During training, only the target probability vector under no disturbance is additionally calculated, and the calculation time consumption of forward propagation is smaller than that of backward propagation updating gradient. Therefore, the training with the unlabeled data samples is the same as the original supervised learning in time complexity. A certain number of labeled exemplars are randomly retained on the plurality of entity classification task data sets, and the remaining labeled exemplars are considered as unlabeled exemplars. Then some neural network models for supervised learning are designed, and the models are compared to use only labeled samples for supervised learning and additionally let no-learningThe performance of the label sample participating in the training can be found to be better when the label-free sample is additionally involved in the training.

On the premise of not changing and limiting the structure of the model M, the method utilizes the property that the sample classification is kept consistent, adds reasonably disturbed input data and the structure of the model random does not change the result of the sample classification, and performs distribution fitting by taking an undisturbed probability vector as a target, so that the supervised learning model can additionally utilize unlabeled data to improve the performance of evaluation indexes. The method and the device have the advantages that the disturbance of the input data acts on the text, and the disturbance is completed by randomly setting text characters as 0-number characters through small probability; the structure of the model itself, randomized, is primarily the Dropout layer in the model, which randomly sets the input neuron to 0 with some probability.

The method can be applied to entity classification tasks, the entities refer to things which exist objectively and can be distinguished from one another, entity classification is a basic task in the field of artificial intelligence, and the determination of classification types of the entities can indirectly help text classification, knowledge graph construction and the like. When the entity classification task is actually solved, unlike academic research with labeled data sets, a large number of entities have no known classification types, but related information of the entities can be acquired through various data sources inside and outside, which is a typical situation that artificial intelligence is applied, namely, related information of samples is relatively sufficient, and the samples with labels are difficult to obtain. According to the method and the device, the label-free samples can be effectively utilized, the problem of insufficient data is relieved, and the performance of the classification model is improved.

Fig. 2 is an overall flowchart of a method for improving a classification model representation by using unlabeled text data samples according to an embodiment of the present application.

As shown in fig. 2, in the method for improving the performance of the classification model by using the unlabeled text data sample, a labeled sample set S is input_labeledUnlabeled sample set S_unlabeledAnd C ═ C in the classification set₁,…,c_kA model M for supervising learning; initializing parameters of a neural network model M for supervised learning and determining a disturbance probability p₁、p₂A numerical value; sampling a small batch with label or without labelLabel sample B₁、B₂(ii) a Calculating a loss function, and reversely propagating and updating the parameter gradient; judging whether the number of training rounds is enough, if not, returning to the step of sampling a small batch of labeled and unlabeled samples B₁、B₂(ii) a If so, keeping the model parameter with the best evaluation index; and outputting the trained model.

Fig. 3 is an illustration of a supervised learning model M of a method for improving a classification model performance by using unlabeled text data samples according to an embodiment of the present application.

As shown in fig. 3, in the supervised learning model M of the method for improving the performance of the classification model by using the unlabeled text data samples, the linear layer in the model is easy to degrade because of many parameters, and then a Dropout layer is connected as far as possible to improve the generalization capability of the model. Taking an entity classification task as an example, information possibly possessed by an entity includes an entity name, related text segments of a plurality of entities and the like, a typical supervised learning model for solving the task is obtained, vector representation of the entity after coding is obtained through operations such as text coding, pooling and the like, and a classification result of model prediction is obtained through vector splicing and an MLP layer.

As shown in fig. 4, the apparatus for enhancing the representation of the classification model by using the unlabeled text data sample includes: an acquisition module 10, an initialization module 20, a training module 30, wherein,

the acquisition module 10 is configured to acquire a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set, and a supervised learning model, where the classification type set of the classification task includes all possible types of samples, the classification model needs to predict the type of an input sample by using the classification type set of the classification task, and the supervised learning model is generated by using labeled samples for training;

the initialization module 20 is configured to perform parameter initialization on the supervised learning model, and determine a first perturbation probability and a second perturbation probability, where the first perturbation probability is a probability of randomly setting an input text character as a 0-th character, the 0-th character represents a corresponding fixed and unchangeable all-0 variable, and the second perturbation probability represents a probability of a stochastic layer in the model;

the training module 30 is configured to repeatedly train the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluate the performance of the supervised learning model on the verification set after each training, record a current model parameter if the evaluation index performance is better, stop the training when the repeated training reaches a preset condition, and output a model finally completed by the training.

The device for improving the performance of the classification model by using the unlabeled text data sample in the embodiment of the application comprises the following steps: the system comprises an acquisition module, an initialization module and a training module, wherein the acquisition module is used for acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to predict the type of an input sample by using the classification type set of the classification task, and the supervised learning model is generated by training by using the labeled samples; the initialization module is used for carrying out parameter initialization on the supervised learning model and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a 0-number character, the 0-number character represents a full 0 variable which is correspondingly fixed and unchanged, and the second disturbance probability represents the probability of a randomized layer in the model; and the training module is used for repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model finally finished by the training. Therefore, the technical problems that the existing method needs sufficient labeled samples for training and ignores the unlabeled samples can be solved, the problem that a semi-supervised/unsupervised learning model designed by using the unlabeled samples in the existing method possibly does not have sufficient unlabeled data to meet the training requirement can be solved, the property that the sample classification is kept consistent is utilized, the input of reasonable disturbance is added to not change the result of sample classification, the original supervised learning model is improved, the unlabeled samples participate in training at the same time, the robustness of the model is improved, the expression of text meaning is not influenced basically by randomly setting characters in the text data of the samples to be the number 0 characters with small probability, the unlabeled samples are effectively utilized, the random disturbance is added through text characters and a model structure, the expression of the classification model is improved, and meanwhile, the application does not need a sufficient number of labeled samples or unlabeled samples, nor to a specific model structure and classification task.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for improving the performance of a classification model by using unlabeled text data samples is characterized by comprising the following steps:

acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to use the classification type set of the classification task to predict the type of an input sample, and the supervised learning model is generated by training with the labeled samples;

initializing parameters of the supervised learning model, and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a 0-number character, the 0-number character represents a corresponding fixed and unchangeable all-0 variable, and the second disturbance probability represents the probability of a randomized layer in the model;

and repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches a preset condition, and outputting the finally trained model.

2. The method of boosting classification model performance with unlabeled text data samples according to claim 1, wherein training the supervised learning model comprises the steps of:

randomly sampling a batch of data from the labeled sample set and the unlabeled sample set respectively to serve as training data;

and calculating a loss function by using the training data, and reversely propagating and updating the parameter gradient.

3. The method of boosting a classification model representation with unlabeled text data samples of claim 2, wherein the penalty function is expressed as:

L_merge＝L_labeled+λL_unlabeled

4. The method of claim 3, wherein the loss function of the labeled exemplar set is the same as the loss function in supervised learning, and the loss function of the unlabeled exemplar set is the distance between the first probability distribution and the second probability distribution.

5. The method of claim 4, wherein the first probability distribution is a probability distribution given by a model with no disturbance added to the unlabeled samples in the training data, and the second probability distribution is a probability distribution given by a model with disturbance added to the unlabeled samples in the training data.

6. The method as claimed in claim 3, wherein if the classification task is multi-label classification, the loss function of the labeled sample set is binary cross entropy, and the loss function of the unlabeled sample set is expressed as:

wherein, B₂For unlabeled samples in the training data, I () represents an indicator function that returns whether the condition for each position of the vector in brackets is true, p_s,cThe class c corresponding to each position representing the prediction vector is judged, and β is a constant threshold value used to judge the prediction given by the modelWhether the result is reliable or not, the model discards unreliable unlabeled samples by beta, is a vector dot product, calculates the distance, p, of two probability distributions_s,εIs the predicted probability, p ', of sample s after the addition of perturbation'_sThe target prediction probability for sample s after polarization prediction,

7. The method of claim 3, wherein if the classification task is multi-class classification, the loss function for the set of labeled samples is cross entropy and the loss function for the set of unlabeled samples is expressed as:

wherein, B₂For the unlabeled samples in the training data, I () represents an indication function, whether the condition in brackets is a true numerical value is returned, beta is a constant threshold value and is used for judging whether the prediction result given by the model is reliable or not, the model discards unreliable unlabeled samples through beta, and p_s,εIs the predicted probability, max (p), of the sample s after adding the perturbation_s) Represents the maximum element value, p'_sThe target prediction probability for sample s after polarization prediction,

wherein x is_iRepresenting the prediction result of the sample calculated by the model on the type i, k representing the number of possible classification types of the sample, 0<Tau is equal to or less than 1, is a temperature constant, controls the degree of polarization, and the closer to 0, the output value of the softmax function is to 0 or 1, and the higher the degree of polarization is.

8. The method of claim 1, wherein the predetermined condition is: and evaluating the performance of the supervised learning model after each round of training on the verification set by adopting an early-stopping mechanism, and finishing the training if the performance of the evaluation index is unchanged and better in the evaluation of the continuous set times.

9. The device for improving the performance of the classification model by using the unlabeled text data sample is characterized by comprising an acquisition module, an initialization module and a training module, wherein,

the acquisition module is used for acquiring a labeled sample set, an unlabeled sample set, a classification type set of a classification task, a verification set and a supervised learning model, wherein the classification type set of the classification task comprises all possible types of samples, the classification model needs to use the classification type set of the classification task to predict the type of an input sample, and the supervised learning model is generated by using labeled samples for training;

the initialization module is used for carrying out parameter initialization on the supervised learning model and determining a first disturbance probability and a second disturbance probability, wherein the first disturbance probability is the probability of randomly setting an input text character as a 0-number character, the 0-number character represents a corresponding fixed and unchangeable all-0 variable, and the second disturbance probability represents the probability of a randomized layer in the model;

the training module is used for repeatedly training the supervised learning model by using the labeled sample set and the unlabeled sample set, evaluating the performance of the supervised learning model on the verification set after each training, recording the current model parameters if the evaluation index performance is better, stopping the training when the repeated training reaches the preset condition, and outputting the model after the final training.