CN111159335B

CN111159335B - Short text classification method based on pyramid pooling and LDA topic model

Info

Publication number: CN111159335B
Application number: CN201911276404.2A
Authority: CN
Inventors: 陈雍君
Original assignee: CETC 7 Research Institute
Current assignee: CETC 7 Research Institute
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-06-23
Anticipated expiration: 2039-12-12
Also published as: CN111159335A

Abstract

The invention discloses a short text classification method based on pyramid pooling and an LDA topic model, which comprises the following steps: constructing a text vector matrix; fixing the vectors of different texts into uniform vector representation through a pyramid pooling model; extracting text topic probability vectors from the text vectors by adopting an LDA topic model to obtain topic probability vectors of the text; splicing the vector passing through the pyramid pooling model with the topic probability vector of the obtained text, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold; and finishing the classification of the short text. The method and the device not only consider the spatial distribution of the words, but also consider the frequency relation of the words, avoid the problem of feature loss, and effectively improve the accuracy of classifying the short texts.

Description

Short text classification method based on pyramid pooling and LDA topic model

Technical Field

The invention relates to the technical field of neural networks, in particular to a short text classification method based on pyramid pooling and an LDA topic model.

Background

In journal editing work, there is a task of distributing a manuscript-examining task to experts in different fields according to the difference of the subjects of the abstract of the author. I.e. short text needs to be classified. Before classifying short texts, firstly, a method is needed to acquire similarity among the short texts so as to classify the short texts.

It is known that short text is very short in length and extremely large in number, so that if short text is vector-represented, there is a problem of high-dimensional sparsity. To address this problem, many researchers have employed short text classification based on combining topic models and word vectors; researchers have also used words of short text to augment the semantic information of short text with word stock. However, the above method has several problems:

(1) The association of words in the short text is not considered, nor is the context between words. That is, the words of the text are assumed to be the same, but if the words are randomly arranged, the text classification is performed according to the method of the previous researcher, and the obtained text classification results are the same. However, the Chinese words have ambiguity, and the arrangement and combination of different words are likely to express different effects, so that the context relationship and the association relationship between the text words do need to be considered when the short text classification is performed.

(2) The problem of inconsistent word numbers and frequencies between different texts is not considered. For example, a text has 15 words, and some words appear multiple times; the B text also has 15 words which are mostly the same, but none of the words appears 2 times or more, and if the short text is classified by using the traditional combination of the topic model and the word vector, the classification results are the same, because LDA (implicit dirichlet allocation, latent Dirichlet Allocation) root-pressing does not consider the frequency of the occurrence of a word in a short text, and the topic is defined only from the word itself.

An example is given below:

a text: 5G network layout technology is realized by adopting artificial intelligence field, and … … development of mobile communication related industry is combined

B text: the development of the artificial intelligence field adopts the artificial intelligence technology and combines the speed of the 5G mobile communication network to carry out related industry layout. Techniques … … by artificial intelligence

From the above example, the word vectors of the word a and the word vector of the word B are mostly the same, and all have artificial intelligence, network, mobile communication, industry, layout and the like, if the topic model and the word vector are adopted to classify short text, the word vectors of the word a and the word vector are likely to be similar, and the word vectors of the word a and the word vector B are likely to be classified into the same class by combining the topic probability vector, however, it is obvious that the technology of the artificial intelligence is proposed in the article B for many times, namely an article in the field of the artificial intelligence, and the article a only adopts the artificial intelligence to carry out 5G network layout, namely an article in the field of the mobile communication. Thus, while conducting the manuscript-examining task, the computer is likely to make an erroneous decision, and 2 such articles are all assigned to experts in the same field.

(3) After obtaining text vectors, the number of words of different texts is different, and the dimensions of the text vectors are different (the dimensions of the vectors of the texts are naturally different because the number of words is different), and a continuous pooling and downsampling method is generally adopted to enable text feature representations to be more abstract, the general practice is to put the text feature representations into a CNN (convolutional neural network) to extract high-level features, and one defect in doing so is to cause the resolution of the text representation feature representations to be reduced quickly, so that the meaning expressed by some key words is likely to be covered.

Disclosure of Invention

The invention aims to solve the problems that the current short text classification technology does not consider the context relation and the incidence relation among text words, and meanwhile, the inconsistent word quantity and frequency among different texts are not considered, so that the text classification accuracy is low, and therefore, the invention provides a short text classification method based on pyramid pooling and an LDA topic model, which not only considers the spatial distribution of words, but also considers the frequency relation of words, avoids the problem of feature loss, and effectively improves the short text classification accuracy.

In order to achieve the above purpose of the present invention, the following technical scheme is adopted: a short text classification method based on pyramid pooling and LDA topic model, the short text classification method comprising the steps of:

s1: constructing a text vector matrix;

s2: fixing the vectors of different texts into uniform vector representation through a pyramid pooling model;

s3: extracting text topic probability vectors from the text vectors in the step S1 by adopting an LDA topic model to obtain topic probability vectors of the text;

s4: splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold;

s5: and finishing the classification of the short text.

Preferably, in step S1, the text vector is constructed, specifically, the words of the text are sequenced in sequence by using vector representation; through word vector division, n words are obtained, and V (w) = { V ₁ ,v ₂ ,...,v _n Wherein n represents the number of words in the text and w is the number arranged in word order

And each word vector is assumed to have an h dimension because each word vector is arranged in order, resulting in a word vector matrix of size (w, h).

Further, in step S2, based on the word vector matrix constructed in step S1, the following pyramid pooling model is used to fix the vectors of different texts into a unified vector representation:

for a word vector matrix of arbitrary size, assume its size as (w, h); if the h 1-dimensional vector is required to be converted into a h 1-dimensional vector with a fixed size; the specific processing of the pyramid pooling model is as follows:

s201: dividing a word vector matrix (w, h) with any size into K1X K1 submatrices, namely, the size of each submatrix is (w/K1, h/K1);

s202, dividing a word vector matrix (w, h) with any size into K2X K2 submatrices, wherein the size of each submatrix is (w/K2, h/K2);

s203: dividing a word vector matrix (w, h) into K3 x K3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);

s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein K1, K2 and K3 are different positive integers, and h1=k1+k1+k2+k3+k3.

Further, step S3, after the text vector matrix is extracted in step S1, training the text by adopting an LDA topic model to obtain a text topic vector, finding out the implied meaning of the text by the text topic vector, and directly extracting the deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1 in size.

And step S4, specifically, splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating a similarity threshold between the text to be detected and any short text under any topic by adopting a cosine similarity method, ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for classification, thereby completing classification of the short text.

The beneficial effects of the invention are as follows:

1. the method and the device can solve the problem of inconsistent lengths of text input vectors, and the pyramid pooling model can not only effectively extract global features and local features, but also prevent the problem of feature loss.

2. According to the invention, the semantic of the word level is considered, the semantic information of the text level is also considered, the semantic information of the text level is solved by constructing the text vector matrix, and the pyramid pooling model can reserve the semantic information of the text level to the maximum extent and unify the vector size; the topic probability vector of the text is extracted through the LDA topic model, the implicit meaning of the text can be mined from depth, and the deep features of the text can be directly extracted.

Drawings

FIG. 1 is a flow chart of the steps of the short text classification method described in example 1.

FIG. 2 is a schematic diagram of embodiment 1 with 100 dimensions for each word vector.

FIG. 3 is a schematic illustration of the process of the pyramid pooling model of example 1.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a short text classification method based on pyramid pooling and LDA topic model, the short text classification method comprises the following steps:

s1: constructing a text vector matrix;

because the number of words in each text is different, the dimensions of the word vector are different for each text.

In order to realize text classification, the traditional technical means is to obtain text vectors by adopting an average method, but the method is easy to lose the characteristics of the text, and the method does not consider the frequency of text words, namely the contribution degree of the words.

In order to solve the problem of feature loss caused by an average method on a word vector matrix, words of a text are arranged in sequence by adopting vector representation in a mode similar to picture pixels. For example, with the development of artificial intelligence, intelligent manufacturing has changed in design, production, management and other links, and the design period is shortened by combining with artificial intelligence means; the production cost is reduced by adopting a man-machine cooperation mode, and the production efficiency is improved; the operation cost is greatly reduced by means of real-time monitoring and detection in multiple production links, the qualification rate of products is improved, and the production efficiency of the manufacturing industry is improved.

Through word vector division, n vocabularies are obtained, namely V1, V2, … and vn are respectively expressed as V (w) = { V ₁ ,v ₂ ,...,v _n Where n represents the number of words in the text and w is the number arranged in word order.

Then the word vector matrix for the last short text may be represented as shown in fig. 2, assuming each word vector has 100 dimensions, ordered in order.

Since each word vector is arranged in order, a word vector matrix of a size (w, h) is obtained, where w is the number arranged in the word order, and h is the dimension possessed by each word vector itself, which is set to h=100 in the present embodiment, but in practice, setting is also required for practical cases.

S2: based on the word vector matrix constructed in the step S1, the pyramid pooling model is adopted to fix the vectors of different texts into uniform vector representation:

for a word vector matrix of arbitrary size, assume its size as (w, h); if the h 1-dimensional vector is required to be converted into a h 1-dimensional vector with a fixed size; i.e. input at the input layer of the pyramid pooling model: a word vector matrix of arbitrary size, assuming a size (w, h); the embodiment is arranged at the output layer: the 21 neurons, namely, the word vector matrix with any size is input, and the 21 submatrices are extracted.

As shown in fig. 3, the feature extraction of the pyramid pooling model is specifically as follows:

s201: dividing a word vector matrix (w, h) with any size into K1 x K1 sub-matrices, wherein in the embodiment K1 is 4, namely the size of each sub-matrix is (w/4,h/4);

s202, dividing a word vector matrix (w, h) with any size into K2X K2 submatrices, wherein in the embodiment K2 is 2, and the size of each submatrix is (w/2,h/2);

s203: dividing the word vector matrix (w, h) into K3×k3 sub-matrices, and taking 1 in this embodiment K3, that is, taking the word vector matrix (w, h) as a sub-matrix, where the size of the sub-matrix is (w, h);

s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein h1=4×4+2×2+1×1=21. Instead of fixing the 21-dimensional features, the designer can set K1, K2, K3 to achieve features of different dimensions as required by himself.

S3: extracting text topic probability vectors from the text vectors in the step S1 by adopting an LDA topic model to obtain topic probability vectors of the text; specifically:

training a text by adopting an LDA topic model after the text vector matrix is extracted in the step S1 to obtain a text topic vector, finding out the implied meaning of the text by the text topic vector, and directly extracting the deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1, this embodiment takes n=21.

S4: splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold; specific:

and (3) splicing the vector obtained in the step (S2) through the pyramid pooling model with the topic probability vector of the text obtained in the step (S3) through vector splicing, and then calculating the similarity between the text to be detected and any short text under any topic by adopting a cosine similarity method, wherein the specific calculation formula is as follows:

wherein a represents a vector after pyramid pooling model, and b represents a topic probability vector of the text obtained in the step S3.

And selecting the topic of the text with the maximum similarity threshold as a final topic for division by ranking the similarity thresholds.

S5: and finishing the classification of the short text.

The short text classification method described by the above embodiment has the following effects:

1. the pyramid pooling model is adopted in the embodiment, short texts with different lengths can be handled, and the relationship between the short text contexts or the association relationship between words can be reflected to a certain extent because the pyramid pooling already takes the relationship between the text contexts into consideration.

2. According to the sequence of the words, a word vector matrix is constructed, the vector matrix not only considers the spatial distribution of the words, but also considers the frequency relation of the words, and the problem of feature loss is avoided.

3. The embodiment fuses the pyramid pooling model and the LDA theme model to provide a short text classification method with universality and high expansibility. According to the method, the problem of feature sparsity caused by the fact that the text length does not need to be considered is solved, and the accuracy of text classification can be improved to a great extent.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A short text classification method based on pyramid pooling and LDA topic model is characterized in that: the short text classification method comprises the following steps:

s1: constructing a text vector matrix;

s5: completing classification of short texts;

step S1, constructing text vectors, namely sequencing words of the text in sequence by adopting vector representation; through word vector division, n words are obtained,

where n represents the number of words in the text and w is the number arranged in word order

And assuming that each word vector has an h dimension because each word vector is arranged in order, thereby obtaining a word vector matrix of (w, h) size;

step S2, based on the word vector matrix constructed in the step S1, fixing vectors of different texts into uniform vector representation by adopting a pyramid pooling model:

s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein K1, K2 and K3 are different positive integers, and h1=K1×K1+K2×K2+K3×K3.

2. The short text classification method based on pyramid pooling and LDA topic model as claimed in claim 1, wherein: step S3, training the text by adopting an LDA topic model after the text vector matrix obtained by the extraction in the step S1 to obtain text topic vectors, finding out implied meanings of the text by the text topic vectors, and directly extracting deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1 in size.

3. The short text classification method based on pyramid pooling and LDA topic model as claimed in claim 2, wherein: and S4, specifically, splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3 through vector splicing, calculating a similarity threshold value between the text to be detected and any short text under any topic by adopting a cosine similarity method, ranking the similarity threshold values, and selecting the topic of the text with the maximum similarity threshold value as a final topic to divide, thereby completing the classification of the short text.