CN115563284B

CN115563284B - Deep multi-instance weak supervision text classification method based on semantics

Info

Publication number: CN115563284B
Application number: CN202211301646.4A
Authority: CN
Inventors: 刘小洋; 尹娟
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-06-23
Anticipated expiration: 2042-10-24
Also published as: CN115563284A

Abstract

The invention provides a semantic-based deep multi-instance weak supervision text classification method, which comprises the following steps: s1, organizing a plurality of comment texts under the same social content into a text package, and distributing labels to the text package, so that topic related packages can be obtained; s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords; and S3, inputting topic related vectors and word vectors into the double-branch neural network as vector pairs, and predicting the text instance through the double-branch neural network to obtain the category of the text instance and the category of the package. According to the method and the device, the text information can be effectively classified on the premise that the social media text data is fast in change, difficult to annotate and seriously deficient in annotation data.

Description

Deep multi-instance weak supervision text classification method based on semantics

Technical Field

The invention relates to the technical field of natural language processing, in particular to a deep multi-instance weak supervision text classification method based on semantics.

Background

With the development of the internet and social media, massive text data are generated every day, and data analyzers usually only pay attention to data related to the self domain or a specific topic, which requires filtering domain or topic related data from massive data. In the process of data grabbing, in order to obtain data which are as abundant as possible, grabbing rule setting is loose, so that the richness of the data is guaranteed, and a lot of data which are irrelevant to topics are introduced. Before analysis, the data truly related to topics are filtered out to ensure the accuracy of analysis. This task can be seen as a two-class question of whether text is related to a topic or not. If there is a corresponding bi-classified annotation dataset, it is simply a supervised text classification problem itself. However, in the internet era, many new network expressions are generated each year, and the natural language change speed is greatly improved, which means that the timeliness of the labeling data is greatly reduced. Meanwhile, topics in social media change with events, content is updated frequently, and unless a large amount of annotation data is updated continuously, the annotation data can be quite different from data contained in new events. In the prior art, a text classification method under the scene that social media text data change fast, annotation is difficult and annotation data is seriously deficient is not shown.

Disclosure of Invention

The invention aims at least solving the technical problems in the prior art, and particularly creatively provides a semantic-based deep multi-instance weak supervision text classification method.

In order to achieve the above object of the present invention, the present invention provides a semantic-based deep multi-instance weak supervision text classification method, comprising the steps of:

s1, organizing a plurality of comment texts under the same social content into a text package, and automatically distributing labels to the text package by utilizing the hierarchical characteristics and topic relevance of social media data, thereby obtaining a topic related package;

s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords; by constructing topic relevance vectors, data imbalance can be avoided, as well as collection and computation costs reduced.

And S3, inputting topic related vectors and word vectors into the double-branch neural network as vector pairs, and predicting the text instance through the double-branch neural network to obtain the category of the text instance and the category of the package.

Further, the step S2 includes the steps of:

s2-1, clustering topic related packages into a plurality of topics through an LDA algorithm, and extracting topic keywords;

S2-2, embedding and representing each keyword in the topic by adopting a fasttext model, and taking a vector average value of the topic strong-correlation keywords as a topic correlation vector;

keywords of topics

Is expressed as +.>

The topic correlation vector is thus expressed as:

wherein V is _T Representing topic related vectors;

k represents the total number of topic strong related keywords.

Further, the method further comprises the following steps: converting the vector pair into a dense vector and inputting the dense vector into a dual-branch neural network;

the dense vector passes through the word vector

Related to topic vector V _T The inner product is formed and then added to the word vector to obtain the formula as follows:

wherein the method comprises the steps of

Is the word vector after superposition, is the input of the double-branch neural network;

[ ·, ] represents two vector connections;

representing a word vector;

x represents the bitwise multiplication of the matrix;

V _T representing topic related vectors;

thus, the input to the dual-branch neural network can be expressed as:

wherein x is _ij The j text table in the i packet is input into the dual-branch neural network;

representing the first superimposed word vector, < ->

Representing a second superimposed word vector, < ->

Representing the L-th superimposed word vector;

l represents the number of words contained in the text;

the right, and represents a set of vectors.

Practice finds word vectors in text

And V _T The addition of the inner product to the word vector is more helpful to the neural network to extract features and classification.

Further, the following operations are performed in the dual-branch neural network:

introducing the hidden variable z= { Z _ij Characterization of the relationship between text instance and package, z _ij Representing the contribution degree of the jth instance of the ith packet to the forward packet contribution of packet i, 0.ltoreq.z _ij Is less than or equal to 1; if Z obeys the distribution p (Z), the probability that the i-th packet is a forward packet can be expressed as:

p(Y _i ＝1|X _i )＝f _{j∈{1,…,N}} {p _θ (y _ij ＝1|x _ij ,z _ij )·[z _ij -γ]} (7)

wherein X is _i Representing an i-th packet;

Y _i a label representing the i-th packet;

f is the mapping operator between text instance to package;

n represents the number of packets;

p _θ (y _ij ＝1|x _ij ,z _ij ) Representing example x _ij A probability of being predicted to be 1;

y _ij a label representing the jth text table in the ith package;

x _ij a j-th text table representing the i-th packet;

z _ij representing the contribution of the jth instance of the ith packet to the forward packet contribution of packet i,

gamma is the average proportion of positive examples in the packet.

The classification of the package is linked with the classification of the text instance, thereby achieving the goal of learning the classification of the text instance itself by the classification of the package.

Further, f is a mean operator. In the problem scene solved by the invention, positive examples contained in the positive packet are not sparse, the false positive packet is easily predicted by using the maximum value and the attention mechanism as mapping operators, and the accuracy is reduced, so that the average operator is adopted.

Further, the following operations are also performed in the dual-branch neural network:

in multi-instance text classification, the goal of learning is to minimize the cross entropy of the packets:

L _i ＝-[Y _i 'logp(Y _i |X _i )-(1-Y _i ')log(1-p(Y _i |X _i ))] (8)

wherein L is _i Representing the cross entropy of the i-th packet;

p(Y _i |X _i ) Representing example X _i Predicted as Y _i Is the output of branch one;

X _i the input characteristic of the ith text packet is input of branch I;

Y _i representing the predicted value of the i-th text packet,

Y _i ' represents the annotation of the ith text packet;

for positive packs, Y _i '＝1，1-Y _i ' =0, thus L _i Expressed as:

for negative packet Y _i ' =0, thus L _i Expressed as:

all text instances in the negative bag are negative and when all p _θ (y _ij |x _ij ,z _ij ) And z _ij When the values of the two voltages are negative,

0, reaching a minimum value;

positive pack, minimize

Is equivalent to p (Y) _i |X _i ) Likelihood value maximization of (a), substituting the formula (7) into

Then, the formula (11) introduces variation inference

Wherein x is _ij A j-th text table representing the i-th packet;

y _ij a label representing the jth text table in the ith package;

z _ij representing the contribution of the jth instance of the ith packet to the forward packet contribution of packet i;

gamma is the average proportion of positive examples in the package;

p _θ (y _ij |x _ij ) Representing example x _ij Is predicted asy _ij Probability of (2);

p (z) represents the p distribution of contribution z;

p _θ (y _ij |x _ij z) represents x _ij The contribution degree of (2) is z, example x _ij Is predicted as y _ij Probability of (2);

p _θ (y _ij |x _ij ,z>gamma) represents x _ij Contribution z of (2)>Gamma, and example x _ij Is predicted as y _ij Probability of (2);

q (z) represents the q distribution of the contribution z;

E _Z～q [·]mean value under the condition that Z obeys q distribution is represented.

Based on the idea of variation inference, the patent of the invention is used for

Approaching q (z).

Further, the neural network is any one of TextCNN, LSTM and transducer.

Further, the method further comprises the following steps: s4, optimizing network parameters of the dual-branch neural network:

s4-1, E step takes KL as a target to minimize, optimizes parameters

The objective function is:

wherein the method comprises the steps of

Representation pair->

p _θ (z|x, Y) performing KL minimization;

representing the output of branch one in the double-branch neural network as the category of the text instance;

p _θ (z|x, Y) represents the output of branch two in the dual-branch neural network, which is the link between the text instance and the package;

z represents the degree of contribution;

y broadly refers to the class of the package;

x generally refers to a text instance in the package;

θ and

is a parameter of two branches;

equation (15) is used to calculate the distribution

And distribution p _θ (z|x, Y) differences, minimizing L _E The distribution q and the distribution p are gradually approximated to achieve the goal of narrowing the difference between the lower bound and the evidence.

In a neural network with a parameter θ, the true distribution of z compliance can be analogized to the posterior distribution p _θ (y|x)。

Wherein the method comprises the steps of

Representation pair->

p _θ (y|x) performing KL minimization;

representing the output of branch one in the double-branch neural network under the condition of Y=1, wherein the output is the category of the text example;

p _θ (y|x) represents that the neural network determined by the parameter θ is fixed at θThe calculated value for each instance p for a negative going packet _θ (y|x) are all 0;

equation 16 is a further evolution of equation 15. Since the negative package only contains irrelevant texts, namely, in the negative package, the category of all text examples is 0, and the contribution degree of the examples to the package is 0, the method can be regarded as supervised learning. Then only positive packets remain, i.e. y=1. Under the condition of terms, the true distribution p obeyed by z is calculated _θ (z|x, Y) is approximately p _θ (y|x), equation 16 is obtained.

Thus (2)

Wherein the method comprises the steps of

Representation pair->

p' performs KL minimization;

Y _i =1 indicates that the i-th packet is positive;

x _ij a j-th text in the i-th package;

y _ij indicating that the j-th instance in the i-th package has a positive contribution to the package;

p′＝p _θ (y|x) representing the value calculated by the neural network for the parameter θ with θ fixed, p for each instance for the negative going packet _θ (y|x) are all 0;

Substitution of p' for p in equation (17) _θ (y|x) due to lovp _θ (y _ij |x _ij ) The up-down property and p of (2) _θ (y|x) is uniform, thus using lovp _θ (y _ij |x _ij ) Substitution p _θ (y|x), the convergence speed is increased. And the packet is negative, i.e., y=0, all probability values of the distribution are set to 0, so equation (17) is obtained.

S4-2, M step fixed parameters

Let the same text down->

And p _θ The KL divergence of (z|x, Y) is unchanged, and then the expectation is maximized by optimizing the parameter θ, and the expectation of the log-likelihood value is expressed as follows

L _M ＝E _Z～q [logp _θ (y _ij |x _ij ,z>γ)] (18)

Wherein L is _M Representing the expectation of log-likelihood values;

E _Z～q [·]representing the mean value of Z under the condition of obeying q distribution;

p _θ (y _ij |x _ij ,z>gamma) is represented at z>Gamma, and the instance i in the text packet is predicted as the probability of text book after the theta branch;

z represents the degree of contribution;

gamma is a superparameter representing the average proportion of text present examples in all positive packets;

l can be defined according to the formula (7) _M Split into two parts bounded by z=γ for z>Gamma is only y _ij Meaning =1, for z<Gamma is only y _ij It makes sense that =0, therefore, the cost function L of M steps _M Can be further disassembled into

Wherein r is a superparameter which represents the average proportion of text in all positive packages;

p _θ (y _ij ＝1|x _ij ) Representing the probability that text instance j in package i is positive text;

p _θ (y _ij ＝0|x _ij ) Representing the probability that text instance j in package i is negative text;

y _ij =1 means that text instance j in packet i is positive;

y _ij =0 means that text instance j in packet i is negative;

equation (19) is obtained by dividing equation (18) based on z=γ as a boundary.

Equation (19) can be converted into cross entropy

L _M ＝y' _ij logp _θ (y _ij |x _ij )-(1-y' _ij )log(1-p _θ (y _ij |x _ij )) (20)

P in formula (20) _θ (y _ij |x _ij ) And p in formula (19) _θ (y _ij ＝1|x _ij ) The meaning of the representations is consistent and the probability of the text being positive is given. 1-p _θ (y _ij |x _ij ) Representing the probability that the text is negative, and p _θ (y _ij ＝0|x _ij ) Is consistent in meaning. Equation (20) is a discretized version of equation (19), discretized into cross entropy.

Wherein y' _ij Is y _ij In positive packets, it is determined by z, in negative packets, all 0;

wherein mean (·) represents averaging;

gamma is the average proportion of positive examples in the packet.

The parameter optimization of the invention is different from the traditional EM algorithm, and because of introducing variation inference, the E step calculates expectations by narrowing the difference between evidence and the variation lower bound, and optimizes the parameter when narrowing the lower bound

M steps maximize the expectation by optimizing the parameter θ.

In summary, due to the adoption of the technical scheme, the invention has the following advantages:

1) By introducing hidden variables and variation inference, the idea of applying the double-branch deep network to traditional multi-instance learning is utilized, and the effect of focusing on instance-level classification problems under multi-instance text classification tasks is effectively improved.

2) The provided weak supervision text classification method SDMI utilizes the characteristics of the social media text data to acquire the labels or the categories divided by the platform established by the social media users as weak supervision information, and effectively optimizes the model through the weak supervision information, so that the pain points of rapid change, difficult labeling and serious shortage of the labeled data of the social media text data are solved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic diagram of a process of extracting topic keywords and calculating topic related vectors by an LDA model according to the present invention.

FIG. 2 is a schematic diagram of an example topic relevance learning model and a contribution learning model of the present invention.

Fig. 3 is a schematic diagram of learning speed trend of the sdic and the supervised learning of the present invention, fig. 3 (a) is a trend of variation of prediction accuracy Acc of the test set, and fig. 3 (b) is a trend of variation of recall F1 score.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Based on the objective facts described in the background art, the time cost and the labor cost of purely supervised text classification application in social media topic data extraction are high, and the unsupervised or weakly supervised classification without labeling data is more practical.

Moreover, although the text data in social media are large in quantity and quick in change, the text data are generated due to the behaviors of users, the data have certain correlation and have a specific hierarchical structure, and a certain clue is provided for data filtering. For example, in a bar, a user typically goes to a bar where the topic is related to open a discussion question, and most of the text content generated in the bar is related to the subject matter of the bar, but there are also many non-topic related content. The subject matter of the entire bar is readily available, but it is difficult to determine which individual pieces of data in the internal discussion are subject-matter-dependent and which are not. If the data of a bar is considered as a package, the topics of the package are known, some of the text in the package is related to the topics, the other text is not related, the text is negative, and the package is provided with positive text and possibly negative text. The problem is very consistent with the non-exactly supervised learning in the weak supervision learning, and the problem can be further defined to be multi-instance learning in the non-exactly supervised learning, namely MIL according to the structural property of data.

The multi-instance learning is divided into two branches according to the difference of classification emphasis points, and focuses on classification of the package and classification of entities in the package. In topic related text data filtering, the aim of filtering data is to extract text from a package as much as possible according to all threads which can be obtained and filter negative text, so that the emphasis of the problem is how to improve the classification effect of text entities.

1. Related art

1.1 Multi-instance learning

Multi-instance learning is typically inexact learning, i.e., the granularity of the labels is coarser than the granularity of the actual tasks, which was introduced by dieterich et al in mid 90 s of the 20 th century. Dietterich proposes a multi-instance concept, applies this concept in drug activity prediction, and proposes APRs learning rules. There are three alternative algorithms for APRs learning rules: noise tolerant standard algorithms, "outlide-in" algorithms, and "inside-out" algorithms, which are essentially finding hyperplane boundaries for positive examples. Early, MILs, like other machine learning algorithms, were mainly implemented by traditional statistical learning methods. Meanwhile, MILs have two different prediction emphasis directions, one is the prediction of the relatively easy package itself and the other is the prediction of more difficult intra-package instances. For example, the EE-DM algorithm proposed by Zhang, Q.et al, combines the EM algorithm with various densities, to steer MIL to an instance in the prediction package. After that, a Support Vector Machine (SVM) method is introduced into MILs, and typical mis m algorithms proposed by s.andrews et al consider MILs problem as maximum edge problem, and the extension of SVM learning method can obtain mixed integer quadratic programming that can be solved heuristically. Traditional MILs also have SbSVM proposed by bunascu et al, stMIL, mangasarian et al, MICA, some of which focus on intra-packet instance prediction, but are accurate for only a few of the positive packets with obvious features.

Along with the development of deep learning, MILs also begin to introduce a deep neural network, and more typically, a packet is input according to a batch, after texts in the packet are extracted through the neural network, a predicted probability value of an instance level is calculated, then the probability values of all texts in the packet are combined by an operator to integrate into a probability of the packet, and a label which is the same as the packet is used as supervision information to optimize the network. Ilse, M. Et al propose a gated attention mechanism (attention) as an operator from the strength level prediction probability to the package level prediction probability and apply this method in image recognition. Wang, y., li et al contrasts the effects of five operators, maximum pooling, average pooling, linear softmax, exponential softmax, and attitution in object localization. Shi, x. Et al incorporate the intent into the loss on the basis of il, m. Et al. The Attention mechanism usually focuses Attention on more obvious features, so these above methods work well for packages containing sparse positive instances, but perform poorly for cases with more positive instances within the package. The optimization process of the invention is based on the level of examples, so that the prediction effect is good no matter the positive examples in the package are sparse or dense.

Luo, Z.et al introduce a dual-branch neural network into MIL, and optimize both neural networks in combination with the EM algorithm, but their methods are applied to motion localization, both branches are optimized with cross entropy as a loss function, and this approach has not been ideal in text classification. Li, B, et al apply the MIL of the neural network of the dual branch in the focus location of medical image, a neural network branches the characteristic of the large image (bag), a branch extracts the characteristic of the small image (example), then classify the large image after fusing. The invention also introduces a double-branch neural network into multi-instance learning, but applies to text classification, and introduces variation inference to convert the classification problem of the packet into instance classification prediction and hidden variable distribution prediction, and adopts cross entropy and KL divergence to optimize the double-branch network.

1.2 weak supervision text classification

Text classification is currently the mainstream practice of supervised learning, but as more and more data is generated by the internet, weak supervision methods are continuously tried. Hingmire S et al propose pre-assigning one or more labels to topics to be extracted by LDA through prior knowledge of corpus statistics, and then classifying documents according to the following topic proportions. In this method, the classification process is completely limited by the range of extraction rules and training data, and the classification effect is poor. Meng y et al propose to produce a tagged pseudo-document based on seed information, train a neural network model with the pseudo-document, and simultaneously self-supervise train a neural network model with real data, and perform text classification in such a way that the two models share network parameters. The method is similar to the method proposed by Hingmire S et al, noise is introduced when pseudo-annotation data is generated through rules, and classification effect is limited according to the principle that garbage enters and exits are trained by a machine learning model. Li, C, et al construct topic semantic space by extracting topic related keywords, then judge the relevance of the text and the topic space by using the existing topic labeling data training model, and when the topic without labeling data appears, only the topic related keywords are required to be extracted, and the model predicts the topic relevance of the text. This approach is more advanced than the first two, but only solves the problem of effectively predicting topics that may be generated in the future, and it still requires a large amount of training data support, which contains a large number of known topics, in contrast to the problem that the present invention intends to solve. The problem to be solved in the invention is to classify topic relevance of texts under the condition of completely inaccurate labeling data, so the invention proposes to use labels provided by first users on social platforms or classification label merits of communities themselves by the platforms as weak supervision signals, and solves the problem that each text can be accurately classified under the condition of completely inaccurate classification training data by introducing a multi-instance idea.

2, the method proposed

The invention adopts deep learning to realize end-to-end multi-instance learning text classification, and carries out topic relevance two classification on social media text data. In the definition of multi-instance learning, an instance is the smallest individual, in the invention, an instance is a single text, one package contains a plurality of instances, two kinds of packages exist together, namely, a positive package and a negative package, wherein the positive package contains at least one text, and the negative package cannot contain any positive text. In the patent of the invention, a plurality of texts with a certain relation in social media are integrated into a package (for example, all comments under a video in a B station can be used as a text package, all replies of a post in a bar can also be used as a package), and when a topic classification is carried out, content related to the topic is searched from the social media, and all the content is extracted to form a plurality of positive packages. When collecting data, the social media platform is used for giving labels as topic labels of packages (such as categories of bar and video category labels uploaded by users), according to definition of multi-instance learning, in order to complete tasks of weak supervision instance-level classification, data under topics are collected as positive packages, and text data of other topics are collected to form negative packages, so that the aim of comparison learning is achieved. Since the task aims to perform relevance two classification on the text examples in the positive package, the label of the package is a weak supervision signal, and the two classification of the examples is completed in a weak supervision mode.

In order to better introduce topic features, besides end-to-end Deep Learning multi-instance text classification, the invention also extracts topic keywords and embeds the topic keywords to obtain topic key features, namely keywords, and introduces the key features into Deep Learning (DL) through a statistical Learning method before Learning. In natural language, each topic has a number of unique representative words, and because of the similarity of the representative keyword contexts and semantics, the representative keywords are closely located in a dense vector space, and the small local space containing all the keyword locations in the vector space can be used to represent the vector space of the topic. The invention uses the characteristic that the text in all packages under the same topic is segmented, and the strong related keywords of the topic are extracted by combining a method of manually selecting related categories of the topic through a hidden Dirichlet distribution (hereinafter referred to as guided-LDA) algorithm with seeds to serve as topic representative words to construct a strong related vector of the topic T, and the construction process of the topic related vector is shown in figure 1.

The reference vector of the topic T and the dense vector of the text instance after vectorization in the same space are used as input pairs, the input pairs are input into a neural network model with two branches to predict, whether one branch prediction text instance contributes to the topic correlation of the package or not, and whether the other branch prediction text instance is related to the topic or not.

The forward propagation architecture of the entire sdic is shown in fig. 2: all texts of one package are taken as one batch, the texts are vectorized and integrated into input data after being subjected to inner product with topic related vectors, the input data are respectively input into two neural network branches, and the neural network branches can select convolutional neural networks or other neural network layers. The present invention selects convolutional neural networks during experimental procedures and during practical applications. The outputs of the different layers are fused and flattened into a one-dimensional vector, converted into a category predicted value through the fully connected layers, and then the predicted value is converted into a predicted probability through softmax. The above is the forward calculation procedure of the whole network. In actual classification, only a forward process is needed to be calculated; during training, the forward process is calculated firstly, and then the pre-calculation of the network is startedThe loss (cost) is measured, the whole training process is divided into an E step and an M step, the p branch parameters of the E step are fixed, and the q branch is optimized by taking KL divergence as a cost function; and (3) fixing the q branch parameters in the M step, and optimizing the p branch parameters by taking the cross entropy as a cost function. Wherein KLD is an abbreviation for Kullback-Leibler divergence, i.e. KL divergence, is given by L _M A representation; CE is an abbreviation for cross-entropy, i.e. cross entropy, with L _E And (3) representing.

In order to optimize the vector space parameters of word embedding and the parameters of two neural network branches, the invention uses the Kullback-Leibler divergence (KL divergence) and the Lower Bound (LB) as the loss functions of the two branches respectively, and adjusts the network parameters by minimizing the KL divergence and maximizing the LB.

In order to better realize the two classification of the topic correlation of the text instance, firstly, extracting keywords representing topics from a topic correlation packet, constructing a topic correlation vector through the keywords, then, inputting the topic correlation vector and the text instance vector as vector pairs into a neural network, carrying out correlation prediction on the instance by a double-branch neural network, and then, obtaining the category of the instance and the category of the packet according to the instance prediction.

2.1 topic relevance vector construction

The MILs task is characterized in that the supervision of the labels is very weak, and from practice, the training data set is found to construct text packets related to topics and text packets not related to topics, and due to the fact that the supervision of the labels is weak and the controllability of the neural network learning process is weak, if no topic is restrained, finally the example classifier tends to predict the text in the negative packet and the text in the positive packet as related to the topics. If the negative packet of the training data set is large enough, the text related to the non-topic can be contained, so that the problem is solved, but the data of the negative packet is too much, the data is unbalanced easily, meanwhile, the collection cost is increased greatly, and the calculation cost is increased greatly. Therefore, in order to avoid the problem, the invention adds the construction of topic related vectors.

The invention provides a method for extracting keywords of topics through an LDA algorithm, and the invention also adopts the method to forward direction in training dataAll text examples in the package gather a plurality of topics through an LDA algorithm, and then a topic set related to the topic T is screened out

And irrelevant topic set->

To minimize the effort of manual intervention, l and m should be as small as possible. Get topic set->

The keywords of the topic T are ranked according to the weights of the keywords in the related topics and the non-related topics, and K keywords most related to the topic T are obtained.

Wherein C represents a strongly related set of keywords for topic T;

the first topic keyword is represented.

Table 1 shows the related keywords of the "Jewelry" and "Beaury" topics extracted on the Amazon Review dataset, and Table 2 shows the related keywords of the Toutao News dataset "Car" and "Sports" topics.

TABLE 1 "Jewelry" and "Beaury" topic keywords extracted from Amazon dataset

TABLE 2 keywords of "Car" and "Sports" topics extracted from Toutao news data set

The invention adopts a fasttext model proposed by Joulin, A. Et al to carry out embedded representation on each keyword in topics, adopts word vector average values of keywords with strong correlation on the topics as topic correlation vectors, and carries out embedding representation on the keywords of the topics

Vectorization by word embedding>

Topic relevance vectors can be expressed as

K represents the total number of topic strong related keywords;

V _T representing topic related vectors;

it is well known that dense embedding of words, in addition to mapping words to vector space, contains syntactic and semantic information, and that hidden correlations between two words can be obtained by simple operations such as subtracting two word vectors, inner product, euclidean distance, etc. Mikolov, T.et al, propose that the subtraction of two word vectors can obtain a relationship between the two words, li, C.L., zhou et al propose that the word vector subtraction and the vector inner product are used simultaneously to represent interactions between the two words. In the task, the word vector of the ith word in the text, namely the text instance vector, is found through practice

Related to topic vector V _T The addition of the inner product to the word vector is more helpful to the neural network to extract features and classification. Thus, the embedding of the text instance takes place in the following manner

Wherein the method comprises the steps of

Is the word vector after superposition, is the input of the double-branch neural network; []Representing the connection of two vectors if the original +.>

Is D-dimensional, ++>

Then the (2D) dimension.

Is->

And V _T By element multiplication of:

x represents the bitwise multiplication of the matrix;

according to the above calculations, the inputs to the dual-branch neural network can be expressed as:

Where L represents the number of words contained in the text, [ ·, ·, ·, ·, ] represents a set or matrix of vectors.

3.2 double-branch prediction neural network

MILs are weak supervised learning, the whole task is divided into two stages, classification of the package and classification of the instance, because each package in the training data has labels, classification of the package can be calculated as supervised learning, and the instance has no labels, and weak supervision can be performed only by means of the labels of the packages. In different tasks, the emphasis on two-stage tasks is different, and Maron, o.et al state that if the task is aimed at instance classification, and the forward instances in the forward package are not too sparse, the dependence of the training process on package labeling should be reduced as much as possible. Among the tasks to which the present patent relates, the classification of the examples is more focused and the vast majority of the package forward examples are higher than 50%, so the present patent focuses on characterizing each example.

Assuming that there are 2N text packets, the first N packets related to topic T are forward packets, denoted as [ X ] ₁ ,X ₂ ,…,X _N ]The last N packets are collected from other topics, uncorrelated with topic T, denoted as [ X ] _N+1 ,X _N+2 ,…,X _2N ]Each package contains n text instances. The ith packet is labeled Y _i The j-th text in the i-th packet is denoted as x _ij The text is processed and then used as an input characteristic value of the neural network. The present invention relates the category of the package to the category of the text instance, thereby achieving the goal of learning the category of the text instance itself from the category of the package. The invention introduces hidden variable Z= { Z _ij Characterization of the relationship between text instance and package, z _ij Representing the contribution degree of the jth instance of the ith packet to the forward packet contribution of packet i, 0.ltoreq.z _ij Not more than 1, assuming that Z obeys the distribution p (Z),

z _ij ～p(z) (6)

then the ith packet Y _i The probability of being positive can be expressed as:

wherein p is _θ (y _ij ＝1|x _ij ,z _ij ) Representing example x _ij The probability predicted to be 1 (i.e., positive), γ is the average ratio of positive instances in the package, and f is the mapping operator from text instance to package. f the common operators have maximum value, average value, attention mechanism and the like, and because the positive examples contained in the positive package are not sparse in the problem scene solved by the invention patent, the maximum value and the attention are usedThe force mechanism is used as a mapping operator to easily predict the false positive packet and reduce the accuracy, so that a mean operator is adopted.

3.2.1 variant inference

In multi-instance text classification to be solved by the present invention, the goal of learning is to minimize cross entropy of packets

L _i ＝-[Y _i 'logp(Y _i |X _i )-(1-Y _i ')log(1-p(Y _i |X _i ))] (8)

Wherein L is _i Representing the cross entropy of the i-th packet;

X _i the input characteristic of the ith text packet is input of branch I;

Y _i representing the predicted value of the i-th text packet,

Y _i ' represents the annotation of the ith text packet;

for positive packs, Y _i '＝1，1-Y _i ' =0, thus L _i Can be expressed as:

for negative packet Y _i ' =0, thus L _i Can be expressed as:

by definition, all text instances in the negative package are negative, and when all p _θ (y _ij |x _ij ,z _ij ) And z _ij When the values of the two voltages are negative,

0, a minimum is reached. Therefore, the negative packet is learned according to supervised logic.

Positive packageMinimizing

Is equivalent to p (Y) _i |X _i ) Likelihood value maximization of (a), substituting formula (7) into +.>

Then, the formula (11) introduces variation inference

Approximation q (z), d>

Represented by a neural network. And p is _θ (y _ij |x _ij ) Represented by another neural network. Thus, the overall approach introduces a two-branch neural network, one predicting the class of text instance and the other linking the text instance to the package.

The task has two emphasis on p _θ (y _ij ＝1|x _ij ) And z _ij Prediction the present patent accomplishes the prediction of these two goals by means of a dual-branched neural network, which can choose TextCNN, LSTM, or even the currently popular transducer. Many works have already well taught these network structures, reference may be made to specific papers, and the patent of the present invention will not be repeated, but only p _θ (y _ij ＝1|x _ij ) And

to represent the outputs of two branches, where θ and +.>

Is a parameter of both branches.

3.3 network parameter optimization

As can be seen from the description of the entire network structure in the previous chapter, the input is X for all positive packets _i Output p _θ (Y _i |X _i ) The labels can be supervised, and the learning goal of the whole network is to maximize the log likelihood:

L＝logp(Y _i |X _i ) (13)

in conjunction with the Jensen inequality, equation (12) may be further derived:

according to the definition of the lower boundary of evidence (The evidence lower bound, ELBO) in statistics, the logp (Y|X) can be regarded as evidence, E _Z～q [logp _θ (y _ij ＝1|x _ij )]Is the lower bound of variation, and the difference between evidence and the lower bound of variation is KL (q (z|x) ||p (z|x, Y)).

Maximizing p (Y) _i |X _i ) The EM algorithm may be used to optimize the parameters. In the conventional EM algorithm, the expectation is calculated in the step E, namely, the lower bound is changed, and in the step M, parameters for maximizing the expectation are searched, so that parameter estimation is performed. While the patent of the invention introduces variation inference, so that the E step calculates the expectation by narrowing the difference between evidence and the lower bound of variation, and optimizes the parameters when narrowing the lower bound

M steps maximize the expectation by optimizing the parameter θ.

3.3.1 Step E, narrowing the difference

The goal of step E is to narrow the difference between the lower bound and the evidence, approximating the lower bound to the expectation, the smaller the KL divergence, the closer the lower bound is to the evidence lovp (Y|X) based on an understanding of equation (14), thus step E targets KL minimization, optimizing the parameters

The objective function is

The classification of packages and instances is a two-class task in which z and y _ij There is a relationship, for example, the contribution z of a negative text instance to a package determined to be a positive package can be considered as 0. Thus, the present patent assumes that in a neural network determined by the parameter θ, the true distribution of z compliance can be analogous to the posterior distribution p _θ (y|x)。

For the text instance in the positive package, p _θ (y= 1|x) is the value calculated by the neural network determined by the parameter θ with θ fixed, and p for each instance for negative packets _θ (y|x) should be 0, and therefore

L can be defined in the meaning of the formula (17) _E It is understood that the contribution of a text instance to the prediction of a package is positive and the relevance of the instance to the topic is subject to the same distribution as much as possible.

3.3.2 M steps maximizing lower bound optimization θ

The cost function of the whole problem consists of two parts, the expectation of the log-likelihood of the predicted probability and the KL divergence of the hidden variable from the true distribution, as defined by equation (14). E, by minimizing KL divergence, the expected and maximum likelihood are approximated passively, and the optimization parameters are adjusted by fixing the parameters theta

M steps fixed parameters->

Let the same text down->

And p _θ The KL divergence of (z|x, Y) is unchanged, and then the expectation of log-likelihood is maximized by optimizing the parameter θ. Expectation L of log likelihood value _M The expression is as follows:

L _M ＝E _Z～q [logp _θ (y _ij |x _ij ,z>γ)] (18)

l can be defined according to the formula (7) _M Split into two parts bounded by z=γ for z>Gamma is only y _ij Meaning =1, for z<Gamma is only y _ij =0. Therefore, cost function L of M steps _M Can be further disassembled into

r is a superparameter that measures how much the contribution is calculated to be the effective contribution. In practical use, the average proportion of the text in all positive packages is set.

Wherein the first part represents the log-likelihood when z > γ and the second part represents the log-likelihood when z < γ.

Equation (19) can be converted into cross entropy

gamma is the average proportion of positive instances in the packet, ranging between (0, 1), which is an empirical value determined by the density of positive instances in the data set, independent of whether the packet is positive, but is related to the recall of the capable instances. mean (·) represents averaging, and γ is the cut y according to formula (19) _ij Where the mean value of the hidden variable is introduced, and gamma is normalized to make it and z _ij Within the same numerical range.

Optimization of

Similarly, when all the instances in the negative packet do not have any contribution to the topic correlation of the packet, the pseudo tag is set to 0, and in the positive packet, the instances with low topic correlation probability do not have high contribution degree, and the contribution degree pseudo tag is set to 0./ >

3 analysis of experimental results

3.1 data set

The present invention uses the 3 data sets AG news, amazon Reviews, toutiao news to compare the difference in effect between SDMI and other methods, respectively. The method comprises the steps that when topics corresponding to AG news and Toutao news are classified, news texts are published by authorities, text formats and contents are more standard, amazon Reviews correspond to products in a certain subdivision area, subjective evaluation of the products by users is achieved, the granularity of the topics contained in the products is finer, the texts are generated by the users, the formats and grammar are more random, and user-generated contents of social media can be simulated. AG news and Amazon Reviews are English text, and Toutao news is Chinese data, thereby ensuring the adaptability of SDMI to different languages.

TABLE 3 description of experimental data sets

AG News: AG news contains Business, sci_Tech, sports and World 4 categories, training set contains 30000 texts per category, test set contains 1900 texts per category. Each category of the training set and the testing set is processed independently, the category comprises texts as topic related text examples, the other 3 categories are used as topic unrelated text examples, 50 texts are contained according to each package, the topic related examples and the unrelated examples are mixed according to random selection proportion of 1:2 to 4:1 to form a positive text package, all negative packages are negative text examples, texts in the negative packages are not overlapped with negative texts in the positive packages, and therefore the structure of the texts in social media and topic relevance states are simulated.

Amazon Reviews: the Amazon Reviews have a plurality of versions, the most classical 2013 version is adopted in the patent, the version comprises 24 groups of products, and 4 products with moderate evaluation quantity are selected as potential forward topics; the comments of each product serve as positively correlated text for the topic, and the comments of other products are negative text examples. The text packages are combined in the same manner as AG news. And cutting the text package of each topic into a training set and a testing set according to a ratio of 4:1.

Toutia news: toutia news contains 12 topics, 4 topics with large text quantity are selected as potential forward topics in the experiment, the text contained in each topic is relevant positive text of the topic, data are randomly extracted from other 11 topics to be used as irrelevant texts, and the topics are combined into a text packet in the same way as AG news. The text packages for each topic are partitioned into a training set and a testing set according to a 4:1 ratio.

After data arrangement, as shown in table 3, the 3 data sets together form 12 topic text packages.

3.2 Experimental procedure and parameter set-up

In the experimental process, in order to prove that the proposed SDIC has the advantages under the same condition, other unsupervised and weakly supervised methods are introduced for comparison besides the experiment based on the method proposed by the data set; meanwhile, in order to evaluate the effect gap between SDMI and the supervised method using artificial annotation data, experiments are also performed on supervised text classification under the same neural network results and parameters.

3.2.1 introduction to the comparison algorithm

guided-LDA: the guided-LDA adds seed keywords to each topic on the basis of the LDA, and the direction of clustering is partially constrained by the seed keywords. In the experiment, topic keywords extracted from each topic are used as seeds of part of clusters, the part of clusters are used as topic related clusters, and the clusters with empty seed words are used as topic uncorrelated clusters, so that the classification effect is achieved.

MISSVM and SbMIL: MIVM algorithm regards MIL problem as the biggest edge problem, expands SVM learning method to solve mixed integer quadratic programming. The algorithm solves the maximum edges of the positive and negative packets, considers the edges of the packets as the edges of the instance, predicts the polarity of the instance. The method is originally applied to experiments on MUSK dataset, modifies input feature extraction and applies the input feature extraction to text classification. Sbmils, like mis, incorporate SVM algorithms.

Weighted-MIL: a multi-instance regression method comprises the steps of carrying out vectorization on each instance in a package, estimating the weight value of each instance vector corresponding to each category, calculating the weighted average value of all instances in the package, and calculating the category of the package by using the weighted average value of the instances through an operator.

Attention base and Gated Attention base: and inputting a packet according to a batch, extracting characteristics of texts in the packet through a neural network, calculating a predicted probability value of an instance level, and combining the probability values of all texts in the packet by using an operator to integrate the probability values into the packet, wherein the probability value is the same as the label optimization network of the packet. The attribute base takes an attribute mechanism as an operator, and integrates the text probability value into a packet probability value; gated attention base adds a gating mechanism on the basis of the attention mechanism.

CNN-supervised: general text convolution classification algorithm with words as embedded units.

3.2.2 Experimental parameter settings

In the experimental process, firstly, topic clustering is carried out on texts in positive packages by using LDA, and as the proportion of positive and negative examples of data combining all positive packages is set to be 3:2 in a task, the negative examples come from more than 10 topic categories, and the number of clustered topics is set to be 20. Related topics are determined in a manual confirmation mode, keywords of the top 50 in all topics are taken out, the weight and non-topic weight ratio of each keyword to the topics are calculated, and topic keywords of the top 20 are selected according to the proportion ranking.

And generating a dictionary aiming at each data set, determining the variable size of the network word embedding layer, and initializing the embedding layer of the neural network by adopting fastatex English and Chinese pre-training models.

The neural network can adopt small neural network structures such as TextCNN, biLSTM and the like as characteristic extraction processes and network parameters

And theta is trained according to the mode in section 2, in practical application, pre-trained large language models such as BERT, roBERTa and the like can be cited as initialization parameters, and then the models are finely tuned with a small learning rate in the training process. The aim of the experiment is to compare the superiority of SDMI and unsupervised methods, other weak supervision methods and pure expression, and the difference between the supervised classification effect, rather than exploring which different classical neural networks are expressed in the task, because two branches of the neural networks need to be trained, and the experimental efficiency of adopting a large predictive model is very low, the whole experiment uses the completely consistent TextCNN network structure in different methods to construct the network.

The super parameter gamma is used for measuring the duty ratio of a positive example in a positive package, and in the news data set used in the experiment, the average duty ratio of the positive example is larger than that of the negative example because the package is constructed, and the average duty ratio is set to be 0.4 according to the duty ratio gamma. In real-world social media data applications, the gamma value is set according to the proportion of text related to the average topic.

In the training process, 1e-5 is adopted as learning rate by using SDMI and supervised two-class, the maximum iteration number is set to be 200, the test is supported by the fitting detection, the training is stopped in advance, the test set loss is set to be continuous 20 times without descending, the training is stopped, the supervised two-class test set loss is continuous 5 times without descending, and the training is stopped.

For other method experimental settings, a traditional multi-instance method is input as a characteristic of a CWB type, in the experiment, a countVectoror and TfiTransformer module of a sklearn library is adopted to extract tf-idf characteristics of each data set, and then the data set is input into an algorithm for training, so that each text instance in a text package of a test set is predicted. Other deep learning methods all adopt the same neural network structure and super-parameter setting as SDMI, the weak supervision method takes the same text packet as input, takes the label of the packet as supervision information training model, and CNN-supervision takes the label of a single text as supervision information for training.

3.3 experimental results

4.3.1 Performance analysis

(1) Evaluation index

Common model evaluation indexes include Accuracy (hereinafter referred to as Acc), precision, recall, F1 value and the like, and the invention adopts the effect of an Acc and F1 index evaluation algorithm model.

Average prediction accuracy of Acc measurement model in all test text examples

Precision represents the ratio of true positive examples in text examples predicted to be positive, corresponding to the Precision of the text examples

Recall represents the exact proportion predicted in all true positive examples, which is equivalent to the Recall ratio of the text example.

The F1 value considers both the Precision and Recall of the positive text instance, and F1 will be high only if Precision and Recall are both high.

Therefore, the invention adopts the Acc and F1 indexes to evaluate the model effect simultaneously.

(2) Performance comparative analysis

By experiments on 3 different data sets of AG News, toutao News and Amazon Reviews, the comparison effect can be seen that the SDMI is greatly improved on Acc and F1 compared with the results of unsupervised topic clustering and other weakly supervised text classification methods, and the difference between the effect of supervised classification completely depending on the marked data is very small.

TABLE 4 accuracy above AG News vs. F1 value

Table 5. Accuracy and F1 value comparison above for Toutao News

TABLE 6 accuracy and F1 value comparison above Amazon Reviews

Tables 4-6 show that the performance of SDIC in 3 data sets is obviously improved compared with other unsupervised and semi-supervised methods, in principle, the traditional multi-instance learning is converted into the prediction for a single text through a bimolecular neural network and an EM algorithm, the model is based on the single text to calculate a loss function, the network is optimized, and the experimental result just proves that the method is suitable for the situation of the single text. Other approaches, while intended for instance-level prediction, ultimately rely on packet optimization for model, and classification at the instance level is naturally poorly controlled.

Meanwhile, the accuracy of each topic in the test set is only 3.219% lower than that of the supervised classification, and the F1 value is 2.602% lower. This illustrates that SDMI can learn to a large extent from the labeling and text semantic features of the package.

In addition, from the experimental results of different data sets, SDMIC can be suitable for relevant data screening of topics with different languages and different granularity.

3.3.2 training speed analysis

Besides the prediction accuracy of the method, the invention also carries out comparison analysis on the learning speed and the variation trend of text recall values of different methods in the training process. Fig. 3 shows the trend of the prediction accuracy Acc and recall F1 score of the test set during training with the Car topic of headline news as training and test data, SDMIC and supervised classification methods.

Fig. 3 (a) shows a trend of the prediction accuracy of the test set along with the training iteration number, and it can be seen that SDMIC is used as a weak supervised learning method, and the whole method architecture includes two deep learning networks, and parameters in the two networks are optimized in turn by means of E-M, so that the learning speed is actually much slower than that of supervised learning under the same condition. Fig. 3 (b) is a graph of the recall rate change, which is more sensitive in the supervised approach, but more stable in the sdic recall rate convergence process from a trend perspective.

Although the supervised method is really advantageous from the aspect of learning speed, in practical engineering application, the mode of completely relying on manual annotation data by the supervised method is a great disadvantage of social media text data mining, after all, a great amount of and continuous annotation data has great cost in time and manpower, and the cost is high, so that the problem of learning speed in the training process is almost negligible.

In summary, the method of the invention performs comparative analysis on the proposed SDMI on different languages (Chinese and English), different types and different topics, and the new method performs experiments and evaluations on a plurality of topics of AG News, toutao News and Amazon Reviews data sets respectively, thereby realizing the ideal effect of weak supervision text classification.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The semantic-based deep multi-instance weak supervision text classification method is characterized by comprising the following steps of:

s1, organizing a plurality of comment texts under the same social content into a text package, and distributing labels to the text package, thereby obtaining a topic related package;

s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords;

s3, inputting topic related vectors and word vectors as vector pairs into a double-branch neural network, and predicting a text instance through the double-branch neural network to obtain the category of the text instance and the category of a package;

the following operations are performed in the dual-branch neural network:

p(Y _i ＝1|X _i )＝f _{j∈{1,...,N}} {p _θ (y _ij ＝1|x _ij ,z _ij )·[z _ij -γ]} (7)

Wherein f is a mapping operator between text instance packages, and f is a mean operator;

n represents the number of packets;

the following operations are also performed in the dual-branch neural network:

L _i ＝-[Y _i 'logp(Y _i |X _i )-(1-Y _i ')log(1-p(Y _i |X _i ))] (8)

wherein L is _i Representing the cross entropy of the i-th packet;

X _i the input characteristic of the ith text packet is input of branch I;

Y _i representing the predicted value of the i-th text packet,

Y _i ' represents the annotation of the ith text packet;

for positive packs, Y _i '＝1，1-Y _i ' =0, thus L _i Expressed as:

for negative packet Y _i ' =0, thus L _i Expressed as:

0, reaching a minimum value;

positive pack, minimize

Then, the formula (11) introduces variation inference

Wherein x is _ij A j-th text table representing the i-th packet;

y _ij a label representing the jth text table in the ith package;

gamma is the average proportion of positive examples in the package;

p _θ (y _ij |x _ij ) Representing example x _ij Is predicted as y _ij Probability of (2);

p (z) represents the p distribution of contribution z;

p _θ (y _ij |x _ij z > γ) represents x _ij Contribution z > γ, and example x _ij Is predicted as y _ij Probability of (2);

q (z) represents the q distribution of the contribution z;

2. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, wherein S2 comprises the steps of:

keywords of topics

Is expressed as +.>

The topic correlation vector is thus expressed as:

wherein V is _T Representing topic related vectors;

k represents the total number of topic strong related keywords.

3. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, further comprising: converting the vector pair into a dense vector and inputting the dense vector into a dual-branch neural network;

The dense vector passes through the word vector

Related to topic vector V _T Inner product is made and then is overlapped to word vectorThe formula is as follows:

wherein the method comprises the steps of

[ ·, ] represents two vector connections;

representing a word vector;

x represents the bitwise multiplication of the matrix;

V _T representing topic related vectors;

thus, the input to the dual-branch neural network can be expressed as:

representing the first superimposed word vector, < ->

Representing a second superimposed word vector, < ->

Representing the L-th superimposed word vector;

l represents the number of words contained in the text;

the right, and represents a set of vectors.

4. The semantic-based deep multi-instance weakly supervised text classification method of claim 1, wherein the neural network is any one of TextCNN, LSTM, and Transformer.

5. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, further comprising: s4, optimizing network parameters of the dual-branch neural network:

s4-1, E step takes KL as a target to minimize, optimizes parameters

The objective function is:

wherein the method comprises the steps of

Representation pair->

p' performs KL minimization;

Y _i =1 indicates that the i-th packet is positive;

p _θ (y|x) represents that the neural network determined by the parameter θ is in the case where θ is fixedCalculated value, p for each instance for negative going packets _θ (y|x) are all 0;

s4-2, M step fixed parameters

Let the same text down->

And p _θ KL divergence of (z|x, Y) is unchanged, ++>

Representing the output of branch one in the double-branch neural network as the category of the text instance; p is p _θ (z|x, Y) represents the output of branch two in the dual-branch neural network, which is the link between the text instance and the package;

the expectation is then maximized by optimizing the parameter θ, and the expectation for the log-likelihood value is expressed as follows

L _M ＝E _Z～q [logp _θ (y _ij |x _ij ,z＞γ)] (18)

Wherein L is _M Representing the expectation of log-likelihood values;

p _θ (y _ij |x _ij z > γ) represents the probability that the instance i in the text packet is predicted as text after the θ branch;

z represents the degree of contribution;

gamma is a hyper-parameter;

l can be defined according to the formula (7) _M Split into two parts bounded by z=γ, for z > γ only for y _ij Meaning =1, and for z < γ only for y _ij It makes sense that =0, therefore, the cost function L of M steps _M Can be further disassembled into

Wherein r is a hyper-parameter;

y _ij =1 means that text instance j in packet i is positive;

y _ij =0 means that text instance j in packet i is negative;

equation (19) can be converted into cross entropy

L _M ＝y′ _ij logp _θ (y _ij |x _ij )-(1-y′ _ij )log(1-p _θ (y _ij |x _ij )) (20)

wherein mean (·) represents averaging;

gamma is the average proportion of positive examples in the packet.