CN115563284B - Deep multi-instance weak supervision text classification method based on semantics - Google Patents

Deep multi-instance weak supervision text classification method based on semantics Download PDF

Info

Publication number
CN115563284B
CN115563284B CN202211301646.4A CN202211301646A CN115563284B CN 115563284 B CN115563284 B CN 115563284B CN 202211301646 A CN202211301646 A CN 202211301646A CN 115563284 B CN115563284 B CN 115563284B
Authority
CN
China
Prior art keywords
text
instance
packet
representing
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211301646.4A
Other languages
Chinese (zh)
Other versions
CN115563284A (en
Inventor
刘小洋
尹娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202211301646.4A priority Critical patent/CN115563284B/en
Publication of CN115563284A publication Critical patent/CN115563284A/en
Application granted granted Critical
Publication of CN115563284B publication Critical patent/CN115563284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a semantic-based deep multi-instance weak supervision text classification method, which comprises the following steps: s1, organizing a plurality of comment texts under the same social content into a text package, and distributing labels to the text package, so that topic related packages can be obtained; s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords; and S3, inputting topic related vectors and word vectors into the double-branch neural network as vector pairs, and predicting the text instance through the double-branch neural network to obtain the category of the text instance and the category of the package. According to the method and the device, the text information can be effectively classified on the premise that the social media text data is fast in change, difficult to annotate and seriously deficient in annotation data.

Description

Deep multi-instance weak supervision text classification method based on semantics
Technical Field
The invention relates to the technical field of natural language processing, in particular to a deep multi-instance weak supervision text classification method based on semantics.
Background
With the development of the internet and social media, massive text data are generated every day, and data analyzers usually only pay attention to data related to the self domain or a specific topic, which requires filtering domain or topic related data from massive data. In the process of data grabbing, in order to obtain data which are as abundant as possible, grabbing rule setting is loose, so that the richness of the data is guaranteed, and a lot of data which are irrelevant to topics are introduced. Before analysis, the data truly related to topics are filtered out to ensure the accuracy of analysis. This task can be seen as a two-class question of whether text is related to a topic or not. If there is a corresponding bi-classified annotation dataset, it is simply a supervised text classification problem itself. However, in the internet era, many new network expressions are generated each year, and the natural language change speed is greatly improved, which means that the timeliness of the labeling data is greatly reduced. Meanwhile, topics in social media change with events, content is updated frequently, and unless a large amount of annotation data is updated continuously, the annotation data can be quite different from data contained in new events. In the prior art, a text classification method under the scene that social media text data change fast, annotation is difficult and annotation data is seriously deficient is not shown.
Disclosure of Invention
The invention aims at least solving the technical problems in the prior art, and particularly creatively provides a semantic-based deep multi-instance weak supervision text classification method.
In order to achieve the above object of the present invention, the present invention provides a semantic-based deep multi-instance weak supervision text classification method, comprising the steps of:
s1, organizing a plurality of comment texts under the same social content into a text package, and automatically distributing labels to the text package by utilizing the hierarchical characteristics and topic relevance of social media data, thereby obtaining a topic related package;
s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords; by constructing topic relevance vectors, data imbalance can be avoided, as well as collection and computation costs reduced.
And S3, inputting topic related vectors and word vectors into the double-branch neural network as vector pairs, and predicting the text instance through the double-branch neural network to obtain the category of the text instance and the category of the package.
Further, the step S2 includes the steps of:
s2-1, clustering topic related packages into a plurality of topics through an LDA algorithm, and extracting topic keywords;
S2-2, embedding and representing each keyword in the topic by adopting a fasttext model, and taking a vector average value of the topic strong-correlation keywords as a topic correlation vector;
keywords of topics
Figure GDA0004145500590000021
Is expressed as +.>
Figure GDA0004145500590000022
The topic correlation vector is thus expressed as:
Figure GDA0004145500590000023
wherein V is T Representing topic related vectors;
k represents the total number of topic strong related keywords.
Further, the method further comprises the following steps: converting the vector pair into a dense vector and inputting the dense vector into a dual-branch neural network;
the dense vector passes through the word vector
Figure GDA0004145500590000024
Related to topic vector V T The inner product is formed and then added to the word vector to obtain the formula as follows:
Figure GDA0004145500590000025
wherein the method comprises the steps of
Figure GDA0004145500590000026
Is the word vector after superposition, is the input of the double-branch neural network;
[ ·, ] represents two vector connections;
Figure GDA0004145500590000027
representing a word vector;
Figure GDA0004145500590000028
x represents the bitwise multiplication of the matrix;
V T representing topic related vectors;
thus, the input to the dual-branch neural network can be expressed as:
Figure GDA0004145500590000031
wherein x is ij The j text table in the i packet is input into the dual-branch neural network;
Figure GDA0004145500590000032
representing the first superimposed word vector, < ->
Figure GDA0004145500590000033
Representing a second superimposed word vector, < ->
Figure GDA0004145500590000034
Representing the L-th superimposed word vector;
l represents the number of words contained in the text;
the right, and represents a set of vectors.
Practice finds word vectors in text
Figure GDA0004145500590000035
And V T The addition of the inner product to the word vector is more helpful to the neural network to extract features and classification.
Further, the following operations are performed in the dual-branch neural network:
introducing the hidden variable z= { Z ij Characterization of the relationship between text instance and package, z ij Representing the contribution degree of the jth instance of the ith packet to the forward packet contribution of packet i, 0.ltoreq.z ij Is less than or equal to 1; if Z obeys the distribution p (Z), the probability that the i-th packet is a forward packet can be expressed as:
p(Y i =1|X i )=f j∈{1,…,N} {p θ (y ij =1|x ij ,z ij )·[z ij -γ]} (7)
wherein X is i Representing an i-th packet;
Y i a label representing the i-th packet;
f is the mapping operator between text instance to package;
n represents the number of packets;
p θ (y ij =1|x ij ,z ij ) Representing example x ij A probability of being predicted to be 1;
y ij a label representing the jth text table in the ith package;
x ij a j-th text table representing the i-th packet;
z ij representing the contribution of the jth instance of the ith packet to the forward packet contribution of packet i,
gamma is the average proportion of positive examples in the packet.
The classification of the package is linked with the classification of the text instance, thereby achieving the goal of learning the classification of the text instance itself by the classification of the package.
Further, f is a mean operator. In the problem scene solved by the invention, positive examples contained in the positive packet are not sparse, the false positive packet is easily predicted by using the maximum value and the attention mechanism as mapping operators, and the accuracy is reduced, so that the average operator is adopted.
Further, the following operations are also performed in the dual-branch neural network:
in multi-instance text classification, the goal of learning is to minimize the cross entropy of the packets:
L i =-[Y i 'logp(Y i |X i )-(1-Y i ')log(1-p(Y i |X i ))] (8)
wherein L is i Representing the cross entropy of the i-th packet;
p(Y i |X i ) Representing example X i Predicted as Y i Is the output of branch one;
X i the input characteristic of the ith text packet is input of branch I;
Y i representing the predicted value of the i-th text packet,
Y i ' represents the annotation of the ith text packet;
for positive packs, Y i '=1,1-Y i ' =0, thus L i Expressed as:
Figure GDA0004145500590000041
for negative packet Y i ' =0, thus L i Expressed as:
Figure GDA0004145500590000042
all text instances in the negative bag are negative and when all p θ (y ij |x ij ,z ij ) And z ij When the values of the two voltages are negative,
Figure GDA0004145500590000043
0, reaching a minimum value;
positive pack, minimize
Figure GDA0004145500590000044
Is equivalent to p (Y) i |X i ) Likelihood value maximization of (a), substituting the formula (7) into
Figure GDA0004145500590000045
Then, the formula (11) introduces variation inference
Figure GDA0004145500590000051
Wherein x is ij A j-th text table representing the i-th packet;
y ij a label representing the jth text table in the ith package;
z ij representing the contribution of the jth instance of the ith packet to the forward packet contribution of packet i;
gamma is the average proportion of positive examples in the package;
p θ (y ij |x ij ) Representing example x ij Is predicted asy ij Probability of (2);
p (z) represents the p distribution of contribution z;
p θ (y ij |x ij z) represents x ij The contribution degree of (2) is z, example x ij Is predicted as y ij Probability of (2);
p θ (y ij |x ij ,z>gamma) represents x ij Contribution z of (2)>Gamma, and example x ij Is predicted as y ij Probability of (2);
q (z) represents the q distribution of the contribution z;
E Z~q [·]mean value under the condition that Z obeys q distribution is represented.
Based on the idea of variation inference, the patent of the invention is used for
Figure GDA0004145500590000052
Approaching q (z).
Further, the neural network is any one of TextCNN, LSTM and transducer.
Further, the method further comprises the following steps: s4, optimizing network parameters of the dual-branch neural network:
s4-1, E step takes KL as a target to minimize, optimizes parameters
Figure GDA0004145500590000053
The objective function is:
Figure GDA0004145500590000054
wherein the method comprises the steps of
Figure GDA0004145500590000055
Representation pair->
Figure GDA0004145500590000056
p θ (z|x, Y) performing KL minimization;
Figure GDA0004145500590000057
representing the output of branch one in the double-branch neural network as the category of the text instance;
p θ (z|x, Y) represents the output of branch two in the dual-branch neural network, which is the link between the text instance and the package;
z represents the degree of contribution;
y broadly refers to the class of the package;
x generally refers to a text instance in the package;
θ and
Figure GDA0004145500590000061
is a parameter of two branches;
equation (15) is used to calculate the distribution
Figure GDA0004145500590000062
And distribution p θ (z|x, Y) differences, minimizing L E The distribution q and the distribution p are gradually approximated to achieve the goal of narrowing the difference between the lower bound and the evidence.
In a neural network with a parameter θ, the true distribution of z compliance can be analogized to the posterior distribution p θ (y|x)。
Figure GDA0004145500590000063
Wherein the method comprises the steps of
Figure GDA0004145500590000064
Representation pair->
Figure GDA0004145500590000069
p θ (y|x) performing KL minimization;
Figure GDA00041455005900000610
representing the output of branch one in the double-branch neural network under the condition of Y=1, wherein the output is the category of the text example;
p θ (y|x) represents that the neural network determined by the parameter θ is fixed at θThe calculated value for each instance p for a negative going packet θ (y|x) are all 0;
equation 16 is a further evolution of equation 15. Since the negative package only contains irrelevant texts, namely, in the negative package, the category of all text examples is 0, and the contribution degree of the examples to the package is 0, the method can be regarded as supervised learning. Then only positive packets remain, i.e. y=1. Under the condition of terms, the true distribution p obeyed by z is calculated θ (z|x, Y) is approximately p θ (y|x), equation 16 is obtained.
Thus (2)
Figure GDA0004145500590000065
Figure GDA0004145500590000066
Wherein the method comprises the steps of
Figure GDA00041455005900000611
Representation pair->
Figure GDA0004145500590000067
p' performs KL minimization;
Figure GDA0004145500590000068
representing the output of branch one in the double-branch neural network under the condition of Y=1, wherein the output is the category of the text example;
Y i =1 indicates that the i-th packet is positive;
x ij a j-th text in the i-th package;
y ij indicating that the j-th instance in the i-th package has a positive contribution to the package;
p′=p θ (y|x) representing the value calculated by the neural network for the parameter θ with θ fixed, p for each instance for the negative going packet θ (y|x) are all 0;
Substitution of p' for p in equation (17) θ (y|x) due to lovp θ (y ij |x ij ) The up-down property and p of (2) θ (y|x) is uniform, thus using lovp θ (y ij |x ij ) Substitution p θ (y|x), the convergence speed is increased. And the packet is negative, i.e., y=0, all probability values of the distribution are set to 0, so equation (17) is obtained.
S4-2, M step fixed parameters
Figure GDA0004145500590000071
Let the same text down->
Figure GDA0004145500590000072
And p θ The KL divergence of (z|x, Y) is unchanged, and then the expectation is maximized by optimizing the parameter θ, and the expectation of the log-likelihood value is expressed as follows
L M =E Z~q [logp θ (y ij |x ij ,z>γ)] (18)
Wherein L is M Representing the expectation of log-likelihood values;
E Z~q [·]representing the mean value of Z under the condition of obeying q distribution;
p θ (y ij |x ij ,z>gamma) is represented at z>Gamma, and the instance i in the text packet is predicted as the probability of text book after the theta branch;
z represents the degree of contribution;
gamma is a superparameter representing the average proportion of text present examples in all positive packets;
l can be defined according to the formula (7) M Split into two parts bounded by z=γ for z>Gamma is only y ij Meaning =1, for z<Gamma is only y ij It makes sense that =0, therefore, the cost function L of M steps M Can be further disassembled into
Figure GDA0004145500590000073
Wherein r is a superparameter which represents the average proportion of text in all positive packages;
p θ (y ij =1|x ij ) Representing the probability that text instance j in package i is positive text;
p θ (y ij =0|x ij ) Representing the probability that text instance j in package i is negative text;
y ij =1 means that text instance j in packet i is positive;
y ij =0 means that text instance j in packet i is negative;
equation (19) is obtained by dividing equation (18) based on z=γ as a boundary.
Equation (19) can be converted into cross entropy
L M =y' ij logp θ (y ij |x ij )-(1-y' ij )log(1-p θ (y ij |x ij )) (20)
P in formula (20) θ (y ij |x ij ) And p in formula (19) θ (y ij =1|x ij ) The meaning of the representations is consistent and the probability of the text being positive is given. 1-p θ (y ij |x ij ) Representing the probability that the text is negative, and p θ (y ij =0|x ij ) Is consistent in meaning. Equation (20) is a discretized version of equation (19), discretized into cross entropy.
Wherein y' ij Is y ij In positive packets, it is determined by z, in negative packets, all 0;
Figure GDA0004145500590000081
wherein mean (·) represents averaging;
gamma is the average proportion of positive examples in the packet.
The parameter optimization of the invention is different from the traditional EM algorithm, and because of introducing variation inference, the E step calculates expectations by narrowing the difference between evidence and the variation lower bound, and optimizes the parameter when narrowing the lower bound
Figure GDA0004145500590000082
M steps maximize the expectation by optimizing the parameter θ.
In summary, due to the adoption of the technical scheme, the invention has the following advantages:
1) By introducing hidden variables and variation inference, the idea of applying the double-branch deep network to traditional multi-instance learning is utilized, and the effect of focusing on instance-level classification problems under multi-instance text classification tasks is effectively improved.
2) The provided weak supervision text classification method SDMI utilizes the characteristics of the social media text data to acquire the labels or the categories divided by the platform established by the social media users as weak supervision information, and effectively optimizes the model through the weak supervision information, so that the pain points of rapid change, difficult labeling and serious shortage of the labeled data of the social media text data are solved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
fig. 1 is a schematic diagram of a process of extracting topic keywords and calculating topic related vectors by an LDA model according to the present invention.
FIG. 2 is a schematic diagram of an example topic relevance learning model and a contribution learning model of the present invention.
Fig. 3 is a schematic diagram of learning speed trend of the sdic and the supervised learning of the present invention, fig. 3 (a) is a trend of variation of prediction accuracy Acc of the test set, and fig. 3 (b) is a trend of variation of recall F1 score.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Based on the objective facts described in the background art, the time cost and the labor cost of purely supervised text classification application in social media topic data extraction are high, and the unsupervised or weakly supervised classification without labeling data is more practical.
Moreover, although the text data in social media are large in quantity and quick in change, the text data are generated due to the behaviors of users, the data have certain correlation and have a specific hierarchical structure, and a certain clue is provided for data filtering. For example, in a bar, a user typically goes to a bar where the topic is related to open a discussion question, and most of the text content generated in the bar is related to the subject matter of the bar, but there are also many non-topic related content. The subject matter of the entire bar is readily available, but it is difficult to determine which individual pieces of data in the internal discussion are subject-matter-dependent and which are not. If the data of a bar is considered as a package, the topics of the package are known, some of the text in the package is related to the topics, the other text is not related, the text is negative, and the package is provided with positive text and possibly negative text. The problem is very consistent with the non-exactly supervised learning in the weak supervision learning, and the problem can be further defined to be multi-instance learning in the non-exactly supervised learning, namely MIL according to the structural property of data.
The multi-instance learning is divided into two branches according to the difference of classification emphasis points, and focuses on classification of the package and classification of entities in the package. In topic related text data filtering, the aim of filtering data is to extract text from a package as much as possible according to all threads which can be obtained and filter negative text, so that the emphasis of the problem is how to improve the classification effect of text entities.
1. Related art
1.1 Multi-instance learning
Multi-instance learning is typically inexact learning, i.e., the granularity of the labels is coarser than the granularity of the actual tasks, which was introduced by dieterich et al in mid 90 s of the 20 th century. Dietterich proposes a multi-instance concept, applies this concept in drug activity prediction, and proposes APRs learning rules. There are three alternative algorithms for APRs learning rules: noise tolerant standard algorithms, "outlide-in" algorithms, and "inside-out" algorithms, which are essentially finding hyperplane boundaries for positive examples. Early, MILs, like other machine learning algorithms, were mainly implemented by traditional statistical learning methods. Meanwhile, MILs have two different prediction emphasis directions, one is the prediction of the relatively easy package itself and the other is the prediction of more difficult intra-package instances. For example, the EE-DM algorithm proposed by Zhang, Q.et al, combines the EM algorithm with various densities, to steer MIL to an instance in the prediction package. After that, a Support Vector Machine (SVM) method is introduced into MILs, and typical mis m algorithms proposed by s.andrews et al consider MILs problem as maximum edge problem, and the extension of SVM learning method can obtain mixed integer quadratic programming that can be solved heuristically. Traditional MILs also have SbSVM proposed by bunascu et al, stMIL, mangasarian et al, MICA, some of which focus on intra-packet instance prediction, but are accurate for only a few of the positive packets with obvious features.
Along with the development of deep learning, MILs also begin to introduce a deep neural network, and more typically, a packet is input according to a batch, after texts in the packet are extracted through the neural network, a predicted probability value of an instance level is calculated, then the probability values of all texts in the packet are combined by an operator to integrate into a probability of the packet, and a label which is the same as the packet is used as supervision information to optimize the network. Ilse, M. Et al propose a gated attention mechanism (attention) as an operator from the strength level prediction probability to the package level prediction probability and apply this method in image recognition. Wang, y., li et al contrasts the effects of five operators, maximum pooling, average pooling, linear softmax, exponential softmax, and attitution in object localization. Shi, x. Et al incorporate the intent into the loss on the basis of il, m. Et al. The Attention mechanism usually focuses Attention on more obvious features, so these above methods work well for packages containing sparse positive instances, but perform poorly for cases with more positive instances within the package. The optimization process of the invention is based on the level of examples, so that the prediction effect is good no matter the positive examples in the package are sparse or dense.
Luo, Z.et al introduce a dual-branch neural network into MIL, and optimize both neural networks in combination with the EM algorithm, but their methods are applied to motion localization, both branches are optimized with cross entropy as a loss function, and this approach has not been ideal in text classification. Li, B, et al apply the MIL of the neural network of the dual branch in the focus location of medical image, a neural network branches the characteristic of the large image (bag), a branch extracts the characteristic of the small image (example), then classify the large image after fusing. The invention also introduces a double-branch neural network into multi-instance learning, but applies to text classification, and introduces variation inference to convert the classification problem of the packet into instance classification prediction and hidden variable distribution prediction, and adopts cross entropy and KL divergence to optimize the double-branch network.
1.2 weak supervision text classification
Text classification is currently the mainstream practice of supervised learning, but as more and more data is generated by the internet, weak supervision methods are continuously tried. Hingmire S et al propose pre-assigning one or more labels to topics to be extracted by LDA through prior knowledge of corpus statistics, and then classifying documents according to the following topic proportions. In this method, the classification process is completely limited by the range of extraction rules and training data, and the classification effect is poor. Meng y et al propose to produce a tagged pseudo-document based on seed information, train a neural network model with the pseudo-document, and simultaneously self-supervise train a neural network model with real data, and perform text classification in such a way that the two models share network parameters. The method is similar to the method proposed by Hingmire S et al, noise is introduced when pseudo-annotation data is generated through rules, and classification effect is limited according to the principle that garbage enters and exits are trained by a machine learning model. Li, C, et al construct topic semantic space by extracting topic related keywords, then judge the relevance of the text and the topic space by using the existing topic labeling data training model, and when the topic without labeling data appears, only the topic related keywords are required to be extracted, and the model predicts the topic relevance of the text. This approach is more advanced than the first two, but only solves the problem of effectively predicting topics that may be generated in the future, and it still requires a large amount of training data support, which contains a large number of known topics, in contrast to the problem that the present invention intends to solve. The problem to be solved in the invention is to classify topic relevance of texts under the condition of completely inaccurate labeling data, so the invention proposes to use labels provided by first users on social platforms or classification label merits of communities themselves by the platforms as weak supervision signals, and solves the problem that each text can be accurately classified under the condition of completely inaccurate classification training data by introducing a multi-instance idea.
2, the method proposed
The invention adopts deep learning to realize end-to-end multi-instance learning text classification, and carries out topic relevance two classification on social media text data. In the definition of multi-instance learning, an instance is the smallest individual, in the invention, an instance is a single text, one package contains a plurality of instances, two kinds of packages exist together, namely, a positive package and a negative package, wherein the positive package contains at least one text, and the negative package cannot contain any positive text. In the patent of the invention, a plurality of texts with a certain relation in social media are integrated into a package (for example, all comments under a video in a B station can be used as a text package, all replies of a post in a bar can also be used as a package), and when a topic classification is carried out, content related to the topic is searched from the social media, and all the content is extracted to form a plurality of positive packages. When collecting data, the social media platform is used for giving labels as topic labels of packages (such as categories of bar and video category labels uploaded by users), according to definition of multi-instance learning, in order to complete tasks of weak supervision instance-level classification, data under topics are collected as positive packages, and text data of other topics are collected to form negative packages, so that the aim of comparison learning is achieved. Since the task aims to perform relevance two classification on the text examples in the positive package, the label of the package is a weak supervision signal, and the two classification of the examples is completed in a weak supervision mode.
In order to better introduce topic features, besides end-to-end Deep Learning multi-instance text classification, the invention also extracts topic keywords and embeds the topic keywords to obtain topic key features, namely keywords, and introduces the key features into Deep Learning (DL) through a statistical Learning method before Learning. In natural language, each topic has a number of unique representative words, and because of the similarity of the representative keyword contexts and semantics, the representative keywords are closely located in a dense vector space, and the small local space containing all the keyword locations in the vector space can be used to represent the vector space of the topic. The invention uses the characteristic that the text in all packages under the same topic is segmented, and the strong related keywords of the topic are extracted by combining a method of manually selecting related categories of the topic through a hidden Dirichlet distribution (hereinafter referred to as guided-LDA) algorithm with seeds to serve as topic representative words to construct a strong related vector of the topic T, and the construction process of the topic related vector is shown in figure 1.
The reference vector of the topic T and the dense vector of the text instance after vectorization in the same space are used as input pairs, the input pairs are input into a neural network model with two branches to predict, whether one branch prediction text instance contributes to the topic correlation of the package or not, and whether the other branch prediction text instance is related to the topic or not.
The forward propagation architecture of the entire sdic is shown in fig. 2: all texts of one package are taken as one batch, the texts are vectorized and integrated into input data after being subjected to inner product with topic related vectors, the input data are respectively input into two neural network branches, and the neural network branches can select convolutional neural networks or other neural network layers. The present invention selects convolutional neural networks during experimental procedures and during practical applications. The outputs of the different layers are fused and flattened into a one-dimensional vector, converted into a category predicted value through the fully connected layers, and then the predicted value is converted into a predicted probability through softmax. The above is the forward calculation procedure of the whole network. In actual classification, only a forward process is needed to be calculated; during training, the forward process is calculated firstly, and then the pre-calculation of the network is startedThe loss (cost) is measured, the whole training process is divided into an E step and an M step, the p branch parameters of the E step are fixed, and the q branch is optimized by taking KL divergence as a cost function; and (3) fixing the q branch parameters in the M step, and optimizing the p branch parameters by taking the cross entropy as a cost function. Wherein KLD is an abbreviation for Kullback-Leibler divergence, i.e. KL divergence, is given by L M A representation; CE is an abbreviation for cross-entropy, i.e. cross entropy, with L E And (3) representing.
In order to optimize the vector space parameters of word embedding and the parameters of two neural network branches, the invention uses the Kullback-Leibler divergence (KL divergence) and the Lower Bound (LB) as the loss functions of the two branches respectively, and adjusts the network parameters by minimizing the KL divergence and maximizing the LB.
In order to better realize the two classification of the topic correlation of the text instance, firstly, extracting keywords representing topics from a topic correlation packet, constructing a topic correlation vector through the keywords, then, inputting the topic correlation vector and the text instance vector as vector pairs into a neural network, carrying out correlation prediction on the instance by a double-branch neural network, and then, obtaining the category of the instance and the category of the packet according to the instance prediction.
2.1 topic relevance vector construction
The MILs task is characterized in that the supervision of the labels is very weak, and from practice, the training data set is found to construct text packets related to topics and text packets not related to topics, and due to the fact that the supervision of the labels is weak and the controllability of the neural network learning process is weak, if no topic is restrained, finally the example classifier tends to predict the text in the negative packet and the text in the positive packet as related to the topics. If the negative packet of the training data set is large enough, the text related to the non-topic can be contained, so that the problem is solved, but the data of the negative packet is too much, the data is unbalanced easily, meanwhile, the collection cost is increased greatly, and the calculation cost is increased greatly. Therefore, in order to avoid the problem, the invention adds the construction of topic related vectors.
The invention provides a method for extracting keywords of topics through an LDA algorithm, and the invention also adopts the method to forward direction in training dataAll text examples in the package gather a plurality of topics through an LDA algorithm, and then a topic set related to the topic T is screened out
Figure GDA0004145500590000141
And irrelevant topic set->
Figure GDA0004145500590000142
To minimize the effort of manual intervention, l and m should be as small as possible. Get topic set->
Figure GDA0004145500590000143
The keywords of the topic T are ranked according to the weights of the keywords in the related topics and the non-related topics, and K keywords most related to the topic T are obtained.
Figure GDA0004145500590000144
Wherein C represents a strongly related set of keywords for topic T;
Figure GDA0004145500590000145
the first topic keyword is represented.
Table 1 shows the related keywords of the "Jewelry" and "Beaury" topics extracted on the Amazon Review dataset, and Table 2 shows the related keywords of the Toutao News dataset "Car" and "Sports" topics.
TABLE 1 "Jewelry" and "Beaury" topic keywords extracted from Amazon dataset
Figure GDA0004145500590000146
Figure GDA0004145500590000151
TABLE 2 keywords of "Car" and "Sports" topics extracted from Toutao news data set
Figure GDA0004145500590000152
The invention adopts a fasttext model proposed by Joulin, A. Et al to carry out embedded representation on each keyword in topics, adopts word vector average values of keywords with strong correlation on the topics as topic correlation vectors, and carries out embedding representation on the keywords of the topics
Figure GDA0004145500590000153
Vectorization by word embedding>
Figure GDA0004145500590000154
Topic relevance vectors can be expressed as
Figure GDA0004145500590000155
K represents the total number of topic strong related keywords;
V T representing topic related vectors;
it is well known that dense embedding of words, in addition to mapping words to vector space, contains syntactic and semantic information, and that hidden correlations between two words can be obtained by simple operations such as subtracting two word vectors, inner product, euclidean distance, etc. Mikolov, T.et al, propose that the subtraction of two word vectors can obtain a relationship between the two words, li, C.L., zhou et al propose that the word vector subtraction and the vector inner product are used simultaneously to represent interactions between the two words. In the task, the word vector of the ith word in the text, namely the text instance vector, is found through practice
Figure GDA0004145500590000156
Related to topic vector V T The addition of the inner product to the word vector is more helpful to the neural network to extract features and classification. Thus, the embedding of the text instance takes place in the following manner
Figure GDA0004145500590000161
Wherein the method comprises the steps of
Figure GDA0004145500590000162
Is the word vector after superposition, is the input of the double-branch neural network; []Representing the connection of two vectors if the original +.>
Figure GDA0004145500590000163
Is D-dimensional, ++>
Figure GDA0004145500590000164
Then the (2D) dimension.
Figure GDA0004145500590000165
Is->
Figure GDA0004145500590000166
And V T By element multiplication of:
Figure GDA0004145500590000167
x represents the bitwise multiplication of the matrix;
according to the above calculations, the inputs to the dual-branch neural network can be expressed as:
Figure GDA0004145500590000168
Where L represents the number of words contained in the text, [ ·, ·, ·, ·, ] represents a set or matrix of vectors.
3.2 double-branch prediction neural network
MILs are weak supervised learning, the whole task is divided into two stages, classification of the package and classification of the instance, because each package in the training data has labels, classification of the package can be calculated as supervised learning, and the instance has no labels, and weak supervision can be performed only by means of the labels of the packages. In different tasks, the emphasis on two-stage tasks is different, and Maron, o.et al state that if the task is aimed at instance classification, and the forward instances in the forward package are not too sparse, the dependence of the training process on package labeling should be reduced as much as possible. Among the tasks to which the present patent relates, the classification of the examples is more focused and the vast majority of the package forward examples are higher than 50%, so the present patent focuses on characterizing each example.
Assuming that there are 2N text packets, the first N packets related to topic T are forward packets, denoted as [ X ] 1 ,X 2 ,…,X N ]The last N packets are collected from other topics, uncorrelated with topic T, denoted as [ X ] N+1 ,X N+2 ,…,X 2N ]Each package contains n text instances. The ith packet is labeled Y i The j-th text in the i-th packet is denoted as x ij The text is processed and then used as an input characteristic value of the neural network. The present invention relates the category of the package to the category of the text instance, thereby achieving the goal of learning the category of the text instance itself from the category of the package. The invention introduces hidden variable Z= { Z ij Characterization of the relationship between text instance and package, z ij Representing the contribution degree of the jth instance of the ith packet to the forward packet contribution of packet i, 0.ltoreq.z ij Not more than 1, assuming that Z obeys the distribution p (Z),
z ij ~p(z) (6)
then the ith packet Y i The probability of being positive can be expressed as:
p(Y i =1|X i )=f j∈{1,…,N} {p θ (y ij =1|x ij ,z ij )·[z ij -γ]} (7)
wherein p is θ (y ij =1|x ij ,z ij ) Representing example x ij The probability predicted to be 1 (i.e., positive), γ is the average ratio of positive instances in the package, and f is the mapping operator from text instance to package. f the common operators have maximum value, average value, attention mechanism and the like, and because the positive examples contained in the positive package are not sparse in the problem scene solved by the invention patent, the maximum value and the attention are usedThe force mechanism is used as a mapping operator to easily predict the false positive packet and reduce the accuracy, so that a mean operator is adopted.
3.2.1 variant inference
In multi-instance text classification to be solved by the present invention, the goal of learning is to minimize cross entropy of packets
L i =-[Y i 'logp(Y i |X i )-(1-Y i ')log(1-p(Y i |X i ))] (8)
Wherein L is i Representing the cross entropy of the i-th packet;
p(Y i |X i ) Representing example X i Predicted as Y i Is the output of branch one;
X i the input characteristic of the ith text packet is input of branch I;
Y i representing the predicted value of the i-th text packet,
Y i ' represents the annotation of the ith text packet;
for positive packs, Y i '=1,1-Y i ' =0, thus L i Can be expressed as:
Figure GDA0004145500590000171
for negative packet Y i ' =0, thus L i Can be expressed as:
Figure GDA0004145500590000172
by definition, all text instances in the negative package are negative, and when all p θ (y ij |x ij ,z ij ) And z ij When the values of the two voltages are negative,
Figure GDA0004145500590000173
0, a minimum is reached. Therefore, the negative packet is learned according to supervised logic.
Positive packageMinimizing
Figure GDA0004145500590000174
Is equivalent to p (Y) i |X i ) Likelihood value maximization of (a), substituting formula (7) into +.>
Figure GDA0004145500590000181
Then, the formula (11) introduces variation inference
Figure GDA0004145500590000182
Based on the idea of variation inference, the patent of the invention is used for
Figure GDA0004145500590000183
Approximation q (z), d>
Figure GDA0004145500590000184
Represented by a neural network. And p is θ (y ij |x ij ) Represented by another neural network. Thus, the overall approach introduces a two-branch neural network, one predicting the class of text instance and the other linking the text instance to the package.
The task has two emphasis on p θ (y ij =1|x ij ) And z ij Prediction the present patent accomplishes the prediction of these two goals by means of a dual-branched neural network, which can choose TextCNN, LSTM, or even the currently popular transducer. Many works have already well taught these network structures, reference may be made to specific papers, and the patent of the present invention will not be repeated, but only p θ (y ij =1|x ij ) And
Figure GDA0004145500590000185
to represent the outputs of two branches, where θ and +.>
Figure GDA0004145500590000186
Is a parameter of both branches.
3.3 network parameter optimization
As can be seen from the description of the entire network structure in the previous chapter, the input is X for all positive packets i Output p θ (Y i |X i ) The labels can be supervised, and the learning goal of the whole network is to maximize the log likelihood:
L=logp(Y i |X i ) (13)
in conjunction with the Jensen inequality, equation (12) may be further derived:
Figure GDA0004145500590000187
according to the definition of the lower boundary of evidence (The evidence lower bound, ELBO) in statistics, the logp (Y|X) can be regarded as evidence, E Z~q [logp θ (y ij =1|x ij )]Is the lower bound of variation, and the difference between evidence and the lower bound of variation is KL (q (z|x) ||p (z|x, Y)).
Maximizing p (Y) i |X i ) The EM algorithm may be used to optimize the parameters. In the conventional EM algorithm, the expectation is calculated in the step E, namely, the lower bound is changed, and in the step M, parameters for maximizing the expectation are searched, so that parameter estimation is performed. While the patent of the invention introduces variation inference, so that the E step calculates the expectation by narrowing the difference between evidence and the lower bound of variation, and optimizes the parameters when narrowing the lower bound
Figure GDA0004145500590000191
M steps maximize the expectation by optimizing the parameter θ.
3.3.1 Step E, narrowing the difference
The goal of step E is to narrow the difference between the lower bound and the evidence, approximating the lower bound to the expectation, the smaller the KL divergence, the closer the lower bound is to the evidence lovp (Y|X) based on an understanding of equation (14), thus step E targets KL minimization, optimizing the parameters
Figure GDA0004145500590000192
The objective function is
Figure GDA0004145500590000193
The classification of packages and instances is a two-class task in which z and y ij There is a relationship, for example, the contribution z of a negative text instance to a package determined to be a positive package can be considered as 0. Thus, the present patent assumes that in a neural network determined by the parameter θ, the true distribution of z compliance can be analogous to the posterior distribution p θ (y|x)。
Figure GDA0004145500590000194
For the text instance in the positive package, p θ (y= 1|x) is the value calculated by the neural network determined by the parameter θ with θ fixed, and p for each instance for negative packets θ (y|x) should be 0, and therefore
Figure GDA0004145500590000195
Figure GDA0004145500590000196
L can be defined in the meaning of the formula (17) E It is understood that the contribution of a text instance to the prediction of a package is positive and the relevance of the instance to the topic is subject to the same distribution as much as possible.
3.3.2 M steps maximizing lower bound optimization θ
The cost function of the whole problem consists of two parts, the expectation of the log-likelihood of the predicted probability and the KL divergence of the hidden variable from the true distribution, as defined by equation (14). E, by minimizing KL divergence, the expected and maximum likelihood are approximated passively, and the optimization parameters are adjusted by fixing the parameters theta
Figure GDA0004145500590000197
M steps fixed parameters->
Figure GDA0004145500590000198
Let the same text down->
Figure GDA0004145500590000199
And p θ The KL divergence of (z|x, Y) is unchanged, and then the expectation of log-likelihood is maximized by optimizing the parameter θ. Expectation L of log likelihood value M The expression is as follows:
L M =E Z~q [logp θ (y ij |x ij ,z>γ)] (18)
l can be defined according to the formula (7) M Split into two parts bounded by z=γ for z>Gamma is only y ij Meaning =1, for z<Gamma is only y ij =0. Therefore, cost function L of M steps M Can be further disassembled into
Figure GDA0004145500590000201
r is a superparameter that measures how much the contribution is calculated to be the effective contribution. In practical use, the average proportion of the text in all positive packages is set.
Wherein the first part represents the log-likelihood when z > γ and the second part represents the log-likelihood when z < γ.
Equation (19) can be converted into cross entropy
L M =y' ij logp θ (y ij |x ij )-(1-y' ij )log(1-p θ (y ij |x ij )) (20)
Wherein y' ij Is y ij In positive packets, it is determined by z, in negative packets, all 0;
Figure GDA0004145500590000202
gamma is the average proportion of positive instances in the packet, ranging between (0, 1), which is an empirical value determined by the density of positive instances in the data set, independent of whether the packet is positive, but is related to the recall of the capable instances. mean (·) represents averaging, and γ is the cut y according to formula (19) ij Where the mean value of the hidden variable is introduced, and gamma is normalized to make it and z ij Within the same numerical range.
Optimization of
Figure GDA0004145500590000203
Similarly, when all the instances in the negative packet do not have any contribution to the topic correlation of the packet, the pseudo tag is set to 0, and in the positive packet, the instances with low topic correlation probability do not have high contribution degree, and the contribution degree pseudo tag is set to 0./ >
Figure GDA0004145500590000204
Figure GDA0004145500590000211
3 analysis of experimental results
3.1 data set
The present invention uses the 3 data sets AG news, amazon Reviews, toutiao news to compare the difference in effect between SDMI and other methods, respectively. The method comprises the steps that when topics corresponding to AG news and Toutao news are classified, news texts are published by authorities, text formats and contents are more standard, amazon Reviews correspond to products in a certain subdivision area, subjective evaluation of the products by users is achieved, the granularity of the topics contained in the products is finer, the texts are generated by the users, the formats and grammar are more random, and user-generated contents of social media can be simulated. AG news and Amazon Reviews are English text, and Toutao news is Chinese data, thereby ensuring the adaptability of SDMI to different languages.
TABLE 3 description of experimental data sets
Figure GDA0004145500590000212
Figure GDA0004145500590000221
AG News: AG news contains Business, sci_Tech, sports and World 4 categories, training set contains 30000 texts per category, test set contains 1900 texts per category. Each category of the training set and the testing set is processed independently, the category comprises texts as topic related text examples, the other 3 categories are used as topic unrelated text examples, 50 texts are contained according to each package, the topic related examples and the unrelated examples are mixed according to random selection proportion of 1:2 to 4:1 to form a positive text package, all negative packages are negative text examples, texts in the negative packages are not overlapped with negative texts in the positive packages, and therefore the structure of the texts in social media and topic relevance states are simulated.
Amazon Reviews: the Amazon Reviews have a plurality of versions, the most classical 2013 version is adopted in the patent, the version comprises 24 groups of products, and 4 products with moderate evaluation quantity are selected as potential forward topics; the comments of each product serve as positively correlated text for the topic, and the comments of other products are negative text examples. The text packages are combined in the same manner as AG news. And cutting the text package of each topic into a training set and a testing set according to a ratio of 4:1.
Toutia news: toutia news contains 12 topics, 4 topics with large text quantity are selected as potential forward topics in the experiment, the text contained in each topic is relevant positive text of the topic, data are randomly extracted from other 11 topics to be used as irrelevant texts, and the topics are combined into a text packet in the same way as AG news. The text packages for each topic are partitioned into a training set and a testing set according to a 4:1 ratio.
After data arrangement, as shown in table 3, the 3 data sets together form 12 topic text packages.
3.2 Experimental procedure and parameter set-up
In the experimental process, in order to prove that the proposed SDIC has the advantages under the same condition, other unsupervised and weakly supervised methods are introduced for comparison besides the experiment based on the method proposed by the data set; meanwhile, in order to evaluate the effect gap between SDMI and the supervised method using artificial annotation data, experiments are also performed on supervised text classification under the same neural network results and parameters.
3.2.1 introduction to the comparison algorithm
guided-LDA: the guided-LDA adds seed keywords to each topic on the basis of the LDA, and the direction of clustering is partially constrained by the seed keywords. In the experiment, topic keywords extracted from each topic are used as seeds of part of clusters, the part of clusters are used as topic related clusters, and the clusters with empty seed words are used as topic uncorrelated clusters, so that the classification effect is achieved.
MISSVM and SbMIL: MIVM algorithm regards MIL problem as the biggest edge problem, expands SVM learning method to solve mixed integer quadratic programming. The algorithm solves the maximum edges of the positive and negative packets, considers the edges of the packets as the edges of the instance, predicts the polarity of the instance. The method is originally applied to experiments on MUSK dataset, modifies input feature extraction and applies the input feature extraction to text classification. Sbmils, like mis, incorporate SVM algorithms.
Weighted-MIL: a multi-instance regression method comprises the steps of carrying out vectorization on each instance in a package, estimating the weight value of each instance vector corresponding to each category, calculating the weighted average value of all instances in the package, and calculating the category of the package by using the weighted average value of the instances through an operator.
Attention base and Gated Attention base: and inputting a packet according to a batch, extracting characteristics of texts in the packet through a neural network, calculating a predicted probability value of an instance level, and combining the probability values of all texts in the packet by using an operator to integrate the probability values into the packet, wherein the probability value is the same as the label optimization network of the packet. The attribute base takes an attribute mechanism as an operator, and integrates the text probability value into a packet probability value; gated attention base adds a gating mechanism on the basis of the attention mechanism.
CNN-supervised: general text convolution classification algorithm with words as embedded units.
3.2.2 Experimental parameter settings
In the experimental process, firstly, topic clustering is carried out on texts in positive packages by using LDA, and as the proportion of positive and negative examples of data combining all positive packages is set to be 3:2 in a task, the negative examples come from more than 10 topic categories, and the number of clustered topics is set to be 20. Related topics are determined in a manual confirmation mode, keywords of the top 50 in all topics are taken out, the weight and non-topic weight ratio of each keyword to the topics are calculated, and topic keywords of the top 20 are selected according to the proportion ranking.
And generating a dictionary aiming at each data set, determining the variable size of the network word embedding layer, and initializing the embedding layer of the neural network by adopting fastatex English and Chinese pre-training models.
The neural network can adopt small neural network structures such as TextCNN, biLSTM and the like as characteristic extraction processes and network parameters
Figure GDA0004145500590000241
And theta is trained according to the mode in section 2, in practical application, pre-trained large language models such as BERT, roBERTa and the like can be cited as initialization parameters, and then the models are finely tuned with a small learning rate in the training process. The aim of the experiment is to compare the superiority of SDMI and unsupervised methods, other weak supervision methods and pure expression, and the difference between the supervised classification effect, rather than exploring which different classical neural networks are expressed in the task, because two branches of the neural networks need to be trained, and the experimental efficiency of adopting a large predictive model is very low, the whole experiment uses the completely consistent TextCNN network structure in different methods to construct the network.
The super parameter gamma is used for measuring the duty ratio of a positive example in a positive package, and in the news data set used in the experiment, the average duty ratio of the positive example is larger than that of the negative example because the package is constructed, and the average duty ratio is set to be 0.4 according to the duty ratio gamma. In real-world social media data applications, the gamma value is set according to the proportion of text related to the average topic.
In the training process, 1e-5 is adopted as learning rate by using SDMI and supervised two-class, the maximum iteration number is set to be 200, the test is supported by the fitting detection, the training is stopped in advance, the test set loss is set to be continuous 20 times without descending, the training is stopped, the supervised two-class test set loss is continuous 5 times without descending, and the training is stopped.
For other method experimental settings, a traditional multi-instance method is input as a characteristic of a CWB type, in the experiment, a countVectoror and TfiTransformer module of a sklearn library is adopted to extract tf-idf characteristics of each data set, and then the data set is input into an algorithm for training, so that each text instance in a text package of a test set is predicted. Other deep learning methods all adopt the same neural network structure and super-parameter setting as SDMI, the weak supervision method takes the same text packet as input, takes the label of the packet as supervision information training model, and CNN-supervision takes the label of a single text as supervision information for training.
3.3 experimental results
4.3.1 Performance analysis
(1) Evaluation index
Common model evaluation indexes include Accuracy (hereinafter referred to as Acc), precision, recall, F1 value and the like, and the invention adopts the effect of an Acc and F1 index evaluation algorithm model.
Average prediction accuracy of Acc measurement model in all test text examples
Figure GDA0004145500590000251
Precision represents the ratio of true positive examples in text examples predicted to be positive, corresponding to the Precision of the text examples
Figure GDA0004145500590000252
Recall represents the exact proportion predicted in all true positive examples, which is equivalent to the Recall ratio of the text example.
Figure GDA0004145500590000253
The F1 value considers both the Precision and Recall of the positive text instance, and F1 will be high only if Precision and Recall are both high.
Figure GDA0004145500590000254
Therefore, the invention adopts the Acc and F1 indexes to evaluate the model effect simultaneously.
(2) Performance comparative analysis
By experiments on 3 different data sets of AG News, toutao News and Amazon Reviews, the comparison effect can be seen that the SDMI is greatly improved on Acc and F1 compared with the results of unsupervised topic clustering and other weakly supervised text classification methods, and the difference between the effect of supervised classification completely depending on the marked data is very small.
TABLE 4 accuracy above AG News vs. F1 value
Figure GDA0004145500590000255
Figure GDA0004145500590000261
Table 5. Accuracy and F1 value comparison above for Toutao News
Figure GDA0004145500590000262
TABLE 6 accuracy and F1 value comparison above Amazon Reviews
Figure GDA0004145500590000263
Tables 4-6 show that the performance of SDIC in 3 data sets is obviously improved compared with other unsupervised and semi-supervised methods, in principle, the traditional multi-instance learning is converted into the prediction for a single text through a bimolecular neural network and an EM algorithm, the model is based on the single text to calculate a loss function, the network is optimized, and the experimental result just proves that the method is suitable for the situation of the single text. Other approaches, while intended for instance-level prediction, ultimately rely on packet optimization for model, and classification at the instance level is naturally poorly controlled.
Meanwhile, the accuracy of each topic in the test set is only 3.219% lower than that of the supervised classification, and the F1 value is 2.602% lower. This illustrates that SDMI can learn to a large extent from the labeling and text semantic features of the package.
In addition, from the experimental results of different data sets, SDMIC can be suitable for relevant data screening of topics with different languages and different granularity.
3.3.2 training speed analysis
Besides the prediction accuracy of the method, the invention also carries out comparison analysis on the learning speed and the variation trend of text recall values of different methods in the training process. Fig. 3 shows the trend of the prediction accuracy Acc and recall F1 score of the test set during training with the Car topic of headline news as training and test data, SDMIC and supervised classification methods.
Fig. 3 (a) shows a trend of the prediction accuracy of the test set along with the training iteration number, and it can be seen that SDMIC is used as a weak supervised learning method, and the whole method architecture includes two deep learning networks, and parameters in the two networks are optimized in turn by means of E-M, so that the learning speed is actually much slower than that of supervised learning under the same condition. Fig. 3 (b) is a graph of the recall rate change, which is more sensitive in the supervised approach, but more stable in the sdic recall rate convergence process from a trend perspective.
Although the supervised method is really advantageous from the aspect of learning speed, in practical engineering application, the mode of completely relying on manual annotation data by the supervised method is a great disadvantage of social media text data mining, after all, a great amount of and continuous annotation data has great cost in time and manpower, and the cost is high, so that the problem of learning speed in the training process is almost negligible.
In summary, the method of the invention performs comparative analysis on the proposed SDMI on different languages (Chinese and English), different types and different topics, and the new method performs experiments and evaluations on a plurality of topics of AG News, toutao News and Amazon Reviews data sets respectively, thereby realizing the ideal effect of weak supervision text classification.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims (5)

1. The semantic-based deep multi-instance weak supervision text classification method is characterized by comprising the following steps of:
s1, organizing a plurality of comment texts under the same social content into a text package, and distributing labels to the text package, thereby obtaining a topic related package;
s2, extracting keywords representing topics from the topic related package, and constructing topic related vectors through the keywords;
s3, inputting topic related vectors and word vectors as vector pairs into a double-branch neural network, and predicting a text instance through the double-branch neural network to obtain the category of the text instance and the category of a package;
the following operations are performed in the dual-branch neural network:
introducing the hidden variable z= { Z ij Characterization of the relationship between text instance and package, z ij Representing the contribution degree of the jth instance of the ith packet to the forward packet contribution of packet i, 0.ltoreq.z ij Is less than or equal to 1; if Z obeys the distribution p (Z), the probability that the i-th packet is a forward packet can be expressed as:
p(Y i =1|X i )=f j∈{1,...,N} {p θ (y ij =1|x ij ,z ij )·[z ij -γ]} (7)
Wherein f is a mapping operator between text instance packages, and f is a mean operator;
n represents the number of packets;
p θ (y ij =1|x ij ,z ij ) Representing example x ij A probability of being predicted to be 1;
the following operations are also performed in the dual-branch neural network:
in multi-instance text classification, the goal of learning is to minimize the cross entropy of the packets:
L i =-[Y i 'logp(Y i |X i )-(1-Y i ')log(1-p(Y i |X i ))] (8)
wherein L is i Representing the cross entropy of the i-th packet;
p(Y i |X i ) Representing example X i Predicted as Y i Is the output of branch one;
X i the input characteristic of the ith text packet is input of branch I;
Y i representing the predicted value of the i-th text packet,
Y i ' represents the annotation of the ith text packet;
for positive packs, Y i '=1,1-Y i ' =0, thus L i Expressed as:
Figure FDA0004220770810000021
for negative packet Y i ' =0, thus L i Expressed as:
Figure FDA0004220770810000022
all text instances in the negative bag are negative and when all p θ (y ij |x ij ,z ij ) And z ij When the values of the two voltages are negative,
Figure FDA0004220770810000026
0, reaching a minimum value;
positive pack, minimize
Figure FDA0004220770810000023
Is equivalent to p (Y) i |X i ) Likelihood value maximization of (a), substituting the formula (7) into
Figure FDA0004220770810000024
Then, the formula (11) introduces variation inference
Figure FDA0004220770810000025
Wherein x is ij A j-th text table representing the i-th packet;
y ij a label representing the jth text table in the ith package;
z ij representing the contribution of the jth instance of the ith packet to the forward packet contribution of packet i;
gamma is the average proportion of positive examples in the package;
p θ (y ij |x ij ) Representing example x ij Is predicted as y ij Probability of (2);
p (z) represents the p distribution of contribution z;
p θ (y ij |x ij z) represents x ij The contribution degree of (2) is z, example x ij Is predicted as y ij Probability of (2);
p θ (y ij |x ij z > γ) represents x ij Contribution z > γ, and example x ij Is predicted as y ij Probability of (2);
q (z) represents the q distribution of the contribution z;
E Z~q [·]mean value under the condition that Z obeys q distribution is represented.
2. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, wherein S2 comprises the steps of:
s2-1, clustering topic related packages into a plurality of topics through an LDA algorithm, and extracting topic keywords;
s2-2, embedding and representing each keyword in the topic by adopting a fasttext model, and taking a vector average value of the topic strong-correlation keywords as a topic correlation vector;
keywords of topics
Figure FDA0004220770810000031
Is expressed as +.>
Figure FDA0004220770810000032
The topic correlation vector is thus expressed as:
Figure FDA0004220770810000033
wherein V is T Representing topic related vectors;
k represents the total number of topic strong related keywords.
3. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, further comprising: converting the vector pair into a dense vector and inputting the dense vector into a dual-branch neural network;
The dense vector passes through the word vector
Figure FDA0004220770810000039
Related to topic vector V T Inner product is made and then is overlapped to word vectorThe formula is as follows:
Figure FDA0004220770810000034
wherein the method comprises the steps of
Figure FDA0004220770810000035
Is the word vector after superposition, is the input of the double-branch neural network;
[ ·, ] represents two vector connections;
Figure FDA0004220770810000036
representing a word vector;
Figure FDA0004220770810000037
x represents the bitwise multiplication of the matrix;
V T representing topic related vectors;
thus, the input to the dual-branch neural network can be expressed as:
Figure FDA0004220770810000038
wherein x is ij The j text table in the i packet is input into the dual-branch neural network;
Figure FDA0004220770810000041
representing the first superimposed word vector, < ->
Figure FDA0004220770810000042
Representing a second superimposed word vector, < ->
Figure FDA0004220770810000043
Representing the L-th superimposed word vector;
l represents the number of words contained in the text;
the right, and represents a set of vectors.
4. The semantic-based deep multi-instance weakly supervised text classification method of claim 1, wherein the neural network is any one of TextCNN, LSTM, and Transformer.
5. The semantic-based deep multi-instance weakly-supervised text classification method of claim 1, further comprising: s4, optimizing network parameters of the dual-branch neural network:
s4-1, E step takes KL as a target to minimize, optimizes parameters
Figure FDA0004220770810000044
The objective function is:
Figure FDA0004220770810000045
wherein the method comprises the steps of
Figure FDA0004220770810000046
Representation pair->
Figure FDA0004220770810000047
p' performs KL minimization;
Figure FDA0004220770810000048
representing the output of branch one in the double-branch neural network under the condition of Y=1, wherein the output is the category of the text example;
Y i =1 indicates that the i-th packet is positive;
p θ (y|x) represents that the neural network determined by the parameter θ is in the case where θ is fixedCalculated value, p for each instance for negative going packets θ (y|x) are all 0;
s4-2, M step fixed parameters
Figure FDA0004220770810000049
Let the same text down->
Figure FDA00042207708100000410
And p θ KL divergence of (z|x, Y) is unchanged, ++>
Figure FDA00042207708100000411
Representing the output of branch one in the double-branch neural network as the category of the text instance; p is p θ (z|x, Y) represents the output of branch two in the dual-branch neural network, which is the link between the text instance and the package;
the expectation is then maximized by optimizing the parameter θ, and the expectation for the log-likelihood value is expressed as follows
L M =E Z~q [logp θ (y ij |x ij ,z>γ)] (18)
Wherein L is M Representing the expectation of log-likelihood values;
E Z~q [·]representing the mean value of Z under the condition of obeying q distribution;
p θ (y ij |x ij z > γ) represents the probability that the instance i in the text packet is predicted as text after the θ branch;
z represents the degree of contribution;
gamma is a hyper-parameter;
l can be defined according to the formula (7) M Split into two parts bounded by z=γ, for z > γ only for y ij Meaning =1, and for z < γ only for y ij It makes sense that =0, therefore, the cost function L of M steps M Can be further disassembled into
Figure FDA0004220770810000051
Wherein r is a hyper-parameter;
p θ (y ij =1|x ij ) Representing the probability that text instance j in package i is positive text;
p θ (y ij =0|x ij ) Representing the probability that text instance j in package i is negative text;
y ij =1 means that text instance j in packet i is positive;
y ij =0 means that text instance j in packet i is negative;
equation (19) can be converted into cross entropy
L M =y′ ij logp θ (y ij |x ij )-(1-y′ ij )log(1-p θ (y ij |x ij )) (20)
Wherein y' ij Is y ij In positive packets, it is determined by z, in negative packets, all 0;
Figure FDA0004220770810000052
wherein mean (·) represents averaging;
gamma is the average proportion of positive examples in the packet.
CN202211301646.4A 2022-10-24 2022-10-24 Deep multi-instance weak supervision text classification method based on semantics Active CN115563284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211301646.4A CN115563284B (en) 2022-10-24 2022-10-24 Deep multi-instance weak supervision text classification method based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211301646.4A CN115563284B (en) 2022-10-24 2022-10-24 Deep multi-instance weak supervision text classification method based on semantics

Publications (2)

Publication Number Publication Date
CN115563284A CN115563284A (en) 2023-01-03
CN115563284B true CN115563284B (en) 2023-06-23

Family

ID=84767548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211301646.4A Active CN115563284B (en) 2022-10-24 2022-10-24 Deep multi-instance weak supervision text classification method based on semantics

Country Status (1)

Country Link
CN (1) CN115563284B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108849A (en) * 2017-12-31 2018-06-01 厦门大学 A kind of microblog emotional Forecasting Methodology based on Weakly supervised multi-modal deep learning
CN114140786A (en) * 2021-12-03 2022-03-04 杭州师范大学 Scene text recognition method based on HRNet coding and double-branch decoding
CN114722835A (en) * 2022-04-26 2022-07-08 河海大学 Text emotion recognition method based on LDA and BERT fusion improved model

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145677B2 (en) * 2007-03-27 2012-03-27 Faleh Jassem Al-Shameri Automated generation of metadata for mining image and text data
CA2905996C (en) * 2013-03-13 2022-07-19 Guardian Analytics, Inc. Fraud detection and analysis
CN108073677B (en) * 2017-11-02 2021-12-28 中国科学院信息工程研究所 Multi-level text multi-label classification method and system based on artificial intelligence
CN108595632B (en) * 2018-04-24 2022-05-24 福州大学 Hybrid neural network text classification method fusing abstract and main body characteristics
CN109241377B (en) * 2018-08-30 2021-04-23 山西大学 Text document representation method and device based on deep learning topic information enhancement
CN109977413B (en) * 2019-03-29 2023-06-06 南京邮电大学 Emotion analysis method based on improved CNN-LDA
CN111444342B (en) * 2020-03-24 2021-12-10 湖南董因信息技术有限公司 Short text classification method based on multiple weak supervision integration
CN111695466B (en) * 2020-06-01 2023-03-24 西安电子科技大学 Semi-supervised polarization SAR terrain classification method based on feature mixup
US20230306050A1 (en) * 2020-08-05 2023-09-28 Siemens Aktiengesellschaft Decarbonizing BERT with Topics for Efficient Document Classification
CN113140020B (en) * 2021-05-13 2022-10-14 电子科技大学 Method for generating image based on text of countermeasure network generated by accompanying supervision
CN114139641B (en) * 2021-12-02 2024-02-06 中国人民解放军国防科技大学 Multi-modal characterization learning method and system based on local structure transfer
CN114297390B (en) * 2021-12-30 2024-04-02 江南大学 Aspect category identification method and system in long tail distribution scene
CN115114437A (en) * 2022-06-27 2022-09-27 山东师范大学 Gastroscope text classification system based on BERT and double-branch network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108849A (en) * 2017-12-31 2018-06-01 厦门大学 A kind of microblog emotional Forecasting Methodology based on Weakly supervised multi-modal deep learning
CN114140786A (en) * 2021-12-03 2022-03-04 杭州师范大学 Scene text recognition method based on HRNet coding and double-branch decoding
CN114722835A (en) * 2022-04-26 2022-07-08 河海大学 Text emotion recognition method based on LDA and BERT fusion improved model

Also Published As

Publication number Publication date
CN115563284A (en) 2023-01-03

Similar Documents

Publication Publication Date Title
Zhang et al. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering
Fu et al. A survey on instance selection for active learning
Wang et al. Tcn: Table convolutional network for web table interpretation
Wang et al. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec
Liu et al. A new method for knowledge and information management domain ontology graph model
Zhang et al. Taxogen: Constructing topical concept taxonomy by adaptive term embedding and clustering
Saleh et al. A semantic based Web page classification strategy using multi-layered domain ontology
Halevy et al. Discovering structure in the universe of attribute names
Devipriya et al. Deep learning sentiment analysis for recommendations in social applications
Lu et al. Heterogeneous knowledge learning of predictive academic intelligence in transportation
Kamateri et al. An ensemble framework for patent classification
CN115563284B (en) Deep multi-instance weak supervision text classification method based on semantics
CN116450938A (en) Work order recommendation realization method and system based on map
Mylonas et al. Semantic representation of multimedia content: Knowledge representation and semantic indexing
Montenegro et al. Introducing multi-dimensional hierarchical classification: Characterization, solving strategies and performance measures
Kawintiranon et al. DeMis: Data-efficient misinformation detection using reinforcement learning
Chen et al. Learning label independence and relevance for multi-label biomedical text classification
Júnior et al. Novelty detection for multi-label stream classification under extreme verification latency
Paukkeri et al. Learning taxonomic relations from a set of text documents
Jiang et al. Text-to-video: a semantic search engine for internet videos
Ienco et al. Towards the automatic construction of conceptual taxonomies
Ajose-Ismail et al. A systematic review on web page classification
Molano et al. Feature Selection based on sampling and C4. 5 Algorithm to improve the Quality of Text Classification using Naïve Bayes
Chen et al. Hybrid Method for Short Text Topic Modeling
CN118013023B (en) Scientific and technological literature recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant