CN111159335A

CN111159335A - Short text classification method based on pyramid pooling and LDA topic model

Info

Publication number: CN111159335A
Application number: CN201911276404.2A
Authority: CN
Inventors: 陈雍君
Original assignee: CETC 7 Research Institute
Current assignee: CETC 7 Research Institute
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-15
Anticipated expiration: 2039-12-12
Also published as: CN111159335B

Abstract

The invention discloses a short text classification method based on pyramid pooling and an LDA topic model, which comprises the following steps: constructing a text vector matrix; fixing vectors of different texts into a uniform vector representation through a pyramid pooling model; extracting text theme probability vectors from the text vectors by adopting an LDA theme model to obtain the theme probability vectors of the text; splicing the vectors passing through the pyramid pooling model with the obtained theme probability vectors of the texts, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold; and finishing the classification of the short text. The method not only considers the spatial distribution of the words, but also can consider the frequency relation of the words, thereby avoiding the problem of feature loss and effectively improving the accuracy of short text classification.

Description

Short text classification method based on pyramid pooling and LDA topic model

Technical Field

The invention relates to the technical field of neural networks, in particular to a short text classification method based on pyramid pooling and an LDA topic model.

Background

In the journal editing work, one task is to distribute manuscript reviewing tasks to experts in different fields according to different topics of abstracts of authors. I.e. the need to classify short texts. Before short text classification, all that needs to be done is to acquire the similarity between short texts through a method so as to classify the short texts.

As is known to all, short texts are very short in length, and the number of articles is extremely large, so that if the short texts are subjected to vector representation, a high-dimensional sparsity problem exists. To address this problem, many researchers use short text classification based on a combination of topic models and word vectors; researchers also adopt words of the short text and expand semantic information of the short text by means of a word bank. The above-described method, however, has several problems:

(1) the association of words in the short text is not considered, nor is the context between words. That is, it is assumed that the words of the text are the same, but if the words are randomly arranged in a disorganized manner, the text classification is performed according to the method of the previous researchers, and the obtained result of the text classification is the same. However, Chinese words have ambiguity, and different words are likely to be expressed differently in terms of permutation and combination, so that the context and association between text words need to be considered when short text classification is performed.

(2) The problem of inconsistent number and frequency of words between different texts is not considered. For example, the text a has 15 words, and some words appear multiple times; the B text also has 15 words which are mostly the same, but no word appears for 2 times or more, if the traditional short text classification is carried out by combining a topic model and a word vector, the classification result is the same, because the LDA (Latent Dirichlet Allocation) root pressure does not consider the frequency of appearance of a certain word in a certain short text, and the topic is only determined from the appeared words.

An example is given below:

text A: the network layout technology of 5G is realized by adopting the artificial intelligence field and is combined with the development … … of the related industries of mobile communication

Text B: the development of the artificial intelligence field adopts the artificial intelligence technology and combines the speed of the 5G mobile communication network to carry out related industry layout. Techniques … … through artificial intelligence

From the above example, most of the word vectors of the words A and the word vectors of the words B are the same, and all the words A and B have artificial intelligence, network, mobile communication, industry, layout and the like, if the short text classification is performed by adopting the topic model and the word vectors, the word vectors are likely to be similar, and the word vectors are likely to be classified into the same category by combining the topic probability vectors, however, obviously, the artificial intelligence technology is proposed many times in the article B, and is obviously an article in the field of artificial intelligence, while the artificial intelligence is only proposed in the article A to perform 5G network layout, and is obviously an article in the field of mobile communication. Therefore, when the manuscript reviewing task is performed, the computer is likely to make a wrong decision, and 2 articles are distributed to experts in the same field.

(3) The number of words in different texts is different, after a text vector is obtained, because the dimensions of the text vector are different (the dimensions of the text vector are naturally different because the number of words is different), a continuous pooling and downsampling method is generally adopted to enable text feature representation to be more and more abstract, a common method is to put the text feature representation into a CNN (convolutional neural network) to extract high-level features, and the defect of the conventional method is that the resolution of the text representation feature is reduced quickly, and the meaning expressed by some key words is probably covered.

Disclosure of Invention

The invention provides a short text classification method based on pyramid pooling and an LDA topic model, which aims to solve the problem that the accuracy of text classification is low because the context relationship and the association relationship among text words are not considered in the conventional short text classification technology, and the problem of inconsistent word quantity and frequency among different texts is not considered, so that the method not only considers the spatial distribution of the words, but also considers the frequency relationship among the words, avoids the problem of feature loss, and effectively improves the accuracy of short text classification.

In order to achieve the purpose of the invention, the technical scheme is as follows: a short text classification method based on pyramid pooling and LDA topic models comprises the following steps:

s1: constructing a text vector matrix;

s2: fixing vectors of different texts into a uniform vector representation through a pyramid pooling model;

s3: adopting an LDA topic model to extract a text topic probability vector from the text vector in the step S1 to obtain a topic probability vector of the text;

s4: splicing the vectors after the pyramid pooling model in the step S2 with the topic probability vectors of the texts obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold;

s5: and finishing the classification of the short text.

Preferably, in step S1, the constructing text vectors, specifically, sorting the words of the text in order by adopting vector representation; obtaining n words through word vector division, and obtaining V (w) { v ═ v₁,v₂,...,v_nWhere n denotes the number of words in the text and w is the number in order of words

And assumes that each word vector has a dimension of h because each word vector is arranged in order, resulting in a word vector matrix of size (w, h).

Further, in step S2, based on the word vector matrix constructed in step S1, the vectors of different texts are fixed into a unified vector representation by using a pyramid pooling model as follows:

for an arbitrarily sized word vector matrix, assume its size is (w, h); if the vector needs to be converted into a h 1-dimensional vector with fixed size; the pyramid pooling model is specifically processed as follows:

s201: dividing a word vector matrix (w, h) with an arbitrary size into K1K 1 sub-matrices, namely, the size of each sub-matrix is (w/K1, h/K1);

s202, dividing a word vector matrix (w, h) with any size into K2K 2 sub-matrices, wherein the size of each sub-matrix is (w/K2, h/K2);

s203: dividing the word vector matrix (w, h) into K3K 3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);

s204: and fusing the sub-matrixes obtained in the steps S201, S202 and S203 to obtain h1 sub-matrixes, wherein K1, K2 and K3 are positive integers which are different from each other, and h1 is K1K 1+ K2K 2+ K3K 3.

Step S3, after the text vector matrix extracted in step S1 is obtained, training the text by using an LDA topic model to obtain a text topic vector, finding out the meaning implied by the text by using the text topic vector, and thus directly extracting the deep features of the text; the topic probability vector of any text obtained in the step is as follows: z ═ p1, p2, …, pn, where n is equal in size to h 1.

And step S4, specifically, by means of vector splicing, splicing the vector obtained in step S2 through the pyramid pooling model with the topic probability vector of the text obtained in step S3, then calculating a similarity threshold between the text to be detected and any short text under any topic by using a cosine similarity method, ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for division, thereby completing the classification of the short texts.

The invention has the following beneficial effects:

1. the method can solve the problem of inconsistent text input vector lengths, and the pyramid pooling model can effectively extract global features and local features and prevent the loss of the features.

2. The invention considers not only the semantics of the word level, but also the semantic information of the text level, solves the semantic information of the text level by constructing a text vector matrix, and the pyramid pooling model can reserve the semantic information of the text level to the maximum extent and unify the vector size; the topic probability vector of the text is extracted through the LDA topic model, the implicit meaning of the text can be mined from the deep level, and the deep level features of the text can be directly extracted.

Drawings

Fig. 1 is a flowchart of the steps of the short text classification method described in embodiment 1.

Fig. 2 is a schematic diagram of embodiment 1 in which each word vector has 100 dimensions.

FIG. 3 is a schematic process diagram of the pyramid pooling model of example 1.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a short text classification method based on pyramid pooling and LDA topic model includes the following steps:

s1: constructing a text vector matrix;

the dimensions of the word vectors for each text are not the same because the number of words in each text is different.

In order to classify texts, the traditional technical means is to obtain text vectors by using an averaging method, but the method easily loses the characteristics of texts, and the method does not consider the frequency of text words, namely the contribution degree of the words.

In order to solve the problem of feature loss caused by an averaging method on a word vector matrix, words of a text are arranged in sequence by adopting vector representation in a similar picture pixel mode. For example, with the development of artificial intelligence, the intelligent manufacturing has qualitative changes in the links of design, production, management and the like, and the design period is shortened by combining the means of artificial intelligence; the production cost is reduced and the production efficiency is improved by adopting a man-machine cooperation mode; the operation cost is greatly reduced by means of real-time monitoring and detection in multiple production links, the qualification rate of products is improved, and the production efficiency of the manufacturing industry is improved.

Through word vector division, n vocabularies are obtained, namely v1, v2, … and vn, namely the expression is V (w) { v ═ v₁,v₂,...,v_nN represents the number of words in the text, and w is the number arranged in word order.

Then the ordering is done in order and assuming that each word vector has 100 dimensions, the final word vector matrix for the short text can be represented as shown in fig. 2.

Since each word vector is arranged in order, a word vector matrix with a size (w, h) is obtained, where w is the number of word vectors arranged in order and h is the number of dimensions each word vector itself has, and h is 100 in this embodiment, but in practice, it needs to be set for practical situations.

S2: based on the word vector matrix constructed in step S1, the following pyramid pooling model is used to fix the vectors of different texts into a unified vector representation:

for an arbitrarily sized word vector matrix, assume its size is (w, h); if the vector needs to be converted into a h 1-dimensional vector with fixed size; namely, inputting the following data at the input layer of the pyramid pooling model: a word vector matrix of arbitrary size, assuming size (w, h); in the present embodiment, in the output layer: 21 neurons, namely, a word vector matrix with any size is input, and 21 submatrices are extracted.

As shown in fig. 3, the feature extraction of the pyramid pooling model is specifically as follows:

s201: dividing a word vector matrix (w, h) with any size into K1 × K1 sub-matrices, wherein K1 in the embodiment takes 4, that is, the size of each sub-matrix is (w/4, h/4);

s202, dividing a word vector matrix (w, h) with any size into K2 × K2 sub-matrices, wherein K2 in the embodiment takes 2, and the size of each sub-matrix is (w/2, h/2);

s203: dividing the word vector matrix (w, h) into K3 × K3 sub-matrices, where K3 takes 1 in this embodiment, that is, the word vector matrix (w, h) is used as a sub-matrix, and the size of the sub-matrix is (w, h);

s204: and fusing the sub-matrixes obtained in the steps S201, S202 and S203 to obtain h1 sub-matrixes, wherein h1 is 4 × 4+2 × 2+1 × 21. Instead of fixing 21-dimensional features, designers can set K1, K2 and K3 according to their requirements to achieve features of different dimensions.

S3: adopting an LDA topic model to extract a text topic probability vector from the text vector in the step S1 to obtain a topic probability vector of the text; specifically, the method comprises the following steps:

after the text vector matrix extracted in the step S1 is obtained, training the text by adopting an LDA topic model to obtain a text topic vector, finding out the meaning implied by the text through the text topic vector, and directly extracting the deep features of the text in the way; the topic probability vector of any text obtained in the step is as follows: z is { p1, p2, …, pn }, where n is equal to h1, and n is 21 in this embodiment.

S4: splicing the vectors after the pyramid pooling model in the step S2 with the topic probability vectors of the texts obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold; specifically, the method comprises the following steps:

by means of vector splicing, the vectors obtained in the step S2 after passing through the pyramid pooling model are spliced with the topic probability vectors of the text obtained in the step S3, and then the similarity between the text to be detected and any short text under any topic is calculated by adopting a cosine similarity method, wherein the specific calculation formula is as follows:

where a represents the vector after the pyramid pooling model, and b represents the topic probability vector of the text obtained in step S3.

And ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for division.

S5: and finishing the classification of the short text.

By the short text classification method described in the above embodiment, it has the following effects:

1. in the embodiment, the pyramid pooling model can deal with short texts with different lengths, and the pyramid pooling takes the relation between the text contexts into consideration, so that the relation between the short text contexts or the association between words can be embodied to a certain extent.

2. A word vector matrix is constructed according to the sequence of words, the vector matrix not only considers the spatial distribution of the words, but also considers the frequency relation of the words, and the problem of feature loss is avoided.

3. The embodiment combines a pyramid pooling model and an LDA topic model to provide a short text classification method with universality and high expansibility. The method has the advantages that the problem of feature sparsity caused by text length does not need to be considered, and the accuracy of text classification can be improved to a great extent.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A short text classification method based on pyramid pooling and an LDA topic model is characterized in that: the short text classification method comprises the following steps:

s1: constructing a text vector matrix;

s5: and finishing the classification of the short text.

2. The pyramid pooling and LDA topic model-based short text classification method of claim 1, wherein: step S1, constructing text vectors, specifically, sequencing words of the text in sequence by adopting vector representation; obtaining n words through word vector division, and obtaining V (w) { v ═ v₁,v₂,...,v_nWhere n denotes the number of words in the text and w is the number in order of words

3. The pyramid pooling and LDA topic model-based short text classification method of claim 2, wherein: step S2, based on the word vector matrix constructed in step S1, the vectors of different texts are fixed into a unified vector representation by adopting a pyramid pooling model:

s201: dividing a word vector matrix (w, h) with an arbitrary size into K1 × K1 sub-matrices, namely, the size of each sub-matrix is (w/K1, h/K1);

4. The pyramid pooling and LDA topic model-based short text classification method of claim 3, wherein: step S3, after the text vector matrix extracted in step S1 is obtained, an LDA topic model is adopted to train the text to obtain a text topic vector, the meaning implied by the text is found out through the text topic vector, and the direct extraction of the deep features of the text is realized through the way; the topic probability vector of any text obtained in the step is as follows: z ═ p1, p2, …, pn, where n is equal in size to h 1.

5. The pyramid pooling and LDA topic model-based short text classification method of claim 4, wherein: and S4, specifically, splicing the vectors after the pyramid pooling model is passed through in the step S2 and the topic probability vectors of the texts obtained in the step S3 through vector splicing, then calculating the similarity threshold value between the texts to be detected and any short texts under any each topic by adopting a cosine similarity method, ranking the similarity threshold value, selecting the topic of the text with the maximum similarity threshold value as the final topic, and dividing to finish the classification of the short texts.