CN111159335A - Short text classification method based on pyramid pooling and LDA topic model - Google Patents

Short text classification method based on pyramid pooling and LDA topic model Download PDF

Info

Publication number
CN111159335A
CN111159335A CN201911276404.2A CN201911276404A CN111159335A CN 111159335 A CN111159335 A CN 111159335A CN 201911276404 A CN201911276404 A CN 201911276404A CN 111159335 A CN111159335 A CN 111159335A
Authority
CN
China
Prior art keywords
text
vector
topic
model
pyramid pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911276404.2A
Other languages
Chinese (zh)
Other versions
CN111159335B (en
Inventor
陈雍君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 7 Research Institute
Original Assignee
CETC 7 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 7 Research Institute filed Critical CETC 7 Research Institute
Priority to CN201911276404.2A priority Critical patent/CN111159335B/en
Publication of CN111159335A publication Critical patent/CN111159335A/en
Application granted granted Critical
Publication of CN111159335B publication Critical patent/CN111159335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a short text classification method based on pyramid pooling and an LDA topic model, which comprises the following steps: constructing a text vector matrix; fixing vectors of different texts into a uniform vector representation through a pyramid pooling model; extracting text theme probability vectors from the text vectors by adopting an LDA theme model to obtain the theme probability vectors of the text; splicing the vectors passing through the pyramid pooling model with the obtained theme probability vectors of the texts, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold; and finishing the classification of the short text. The method not only considers the spatial distribution of the words, but also can consider the frequency relation of the words, thereby avoiding the problem of feature loss and effectively improving the accuracy of short text classification.

Description

Short text classification method based on pyramid pooling and LDA topic model
Technical Field
The invention relates to the technical field of neural networks, in particular to a short text classification method based on pyramid pooling and an LDA topic model.
Background
In the journal editing work, one task is to distribute manuscript reviewing tasks to experts in different fields according to different topics of abstracts of authors. I.e. the need to classify short texts. Before short text classification, all that needs to be done is to acquire the similarity between short texts through a method so as to classify the short texts.
As is known to all, short texts are very short in length, and the number of articles is extremely large, so that if the short texts are subjected to vector representation, a high-dimensional sparsity problem exists. To address this problem, many researchers use short text classification based on a combination of topic models and word vectors; researchers also adopt words of the short text and expand semantic information of the short text by means of a word bank. The above-described method, however, has several problems:
(1) the association of words in the short text is not considered, nor is the context between words. That is, it is assumed that the words of the text are the same, but if the words are randomly arranged in a disorganized manner, the text classification is performed according to the method of the previous researchers, and the obtained result of the text classification is the same. However, Chinese words have ambiguity, and different words are likely to be expressed differently in terms of permutation and combination, so that the context and association between text words need to be considered when short text classification is performed.
(2) The problem of inconsistent number and frequency of words between different texts is not considered. For example, the text a has 15 words, and some words appear multiple times; the B text also has 15 words which are mostly the same, but no word appears for 2 times or more, if the traditional short text classification is carried out by combining a topic model and a word vector, the classification result is the same, because the LDA (Latent Dirichlet Allocation) root pressure does not consider the frequency of appearance of a certain word in a certain short text, and the topic is only determined from the appeared words.
An example is given below:
text A: the network layout technology of 5G is realized by adopting the artificial intelligence field and is combined with the development … … of the related industries of mobile communication
Text B: the development of the artificial intelligence field adopts the artificial intelligence technology and combines the speed of the 5G mobile communication network to carry out related industry layout. Techniques … … through artificial intelligence
From the above example, most of the word vectors of the words A and the word vectors of the words B are the same, and all the words A and B have artificial intelligence, network, mobile communication, industry, layout and the like, if the short text classification is performed by adopting the topic model and the word vectors, the word vectors are likely to be similar, and the word vectors are likely to be classified into the same category by combining the topic probability vectors, however, obviously, the artificial intelligence technology is proposed many times in the article B, and is obviously an article in the field of artificial intelligence, while the artificial intelligence is only proposed in the article A to perform 5G network layout, and is obviously an article in the field of mobile communication. Therefore, when the manuscript reviewing task is performed, the computer is likely to make a wrong decision, and 2 articles are distributed to experts in the same field.
(3) The number of words in different texts is different, after a text vector is obtained, because the dimensions of the text vector are different (the dimensions of the text vector are naturally different because the number of words is different), a continuous pooling and downsampling method is generally adopted to enable text feature representation to be more and more abstract, a common method is to put the text feature representation into a CNN (convolutional neural network) to extract high-level features, and the defect of the conventional method is that the resolution of the text representation feature is reduced quickly, and the meaning expressed by some key words is probably covered.
Disclosure of Invention
The invention provides a short text classification method based on pyramid pooling and an LDA topic model, which aims to solve the problem that the accuracy of text classification is low because the context relationship and the association relationship among text words are not considered in the conventional short text classification technology, and the problem of inconsistent word quantity and frequency among different texts is not considered, so that the method not only considers the spatial distribution of the words, but also considers the frequency relationship among the words, avoids the problem of feature loss, and effectively improves the accuracy of short text classification.
In order to achieve the purpose of the invention, the technical scheme is as follows: a short text classification method based on pyramid pooling and LDA topic models comprises the following steps:
s1: constructing a text vector matrix;
s2: fixing vectors of different texts into a uniform vector representation through a pyramid pooling model;
s3: adopting an LDA topic model to extract a text topic probability vector from the text vector in the step S1 to obtain a topic probability vector of the text;
s4: splicing the vectors after the pyramid pooling model in the step S2 with the topic probability vectors of the texts obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold;
s5: and finishing the classification of the short text.
Preferably, in step S1, the constructing text vectors, specifically, sorting the words of the text in order by adopting vector representation; obtaining n words through word vector division, and obtaining V (w) { v ═ v1,v2,...,vnWhere n denotes the number of words in the text and w is the number in order of words
And assumes that each word vector has a dimension of h because each word vector is arranged in order, resulting in a word vector matrix of size (w, h).
Further, in step S2, based on the word vector matrix constructed in step S1, the vectors of different texts are fixed into a unified vector representation by using a pyramid pooling model as follows:
for an arbitrarily sized word vector matrix, assume its size is (w, h); if the vector needs to be converted into a h 1-dimensional vector with fixed size; the pyramid pooling model is specifically processed as follows:
s201: dividing a word vector matrix (w, h) with an arbitrary size into K1K 1 sub-matrices, namely, the size of each sub-matrix is (w/K1, h/K1);
s202, dividing a word vector matrix (w, h) with any size into K2K 2 sub-matrices, wherein the size of each sub-matrix is (w/K2, h/K2);
s203: dividing the word vector matrix (w, h) into K3K 3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);
s204: and fusing the sub-matrixes obtained in the steps S201, S202 and S203 to obtain h1 sub-matrixes, wherein K1, K2 and K3 are positive integers which are different from each other, and h1 is K1K 1+ K2K 2+ K3K 3.
Step S3, after the text vector matrix extracted in step S1 is obtained, training the text by using an LDA topic model to obtain a text topic vector, finding out the meaning implied by the text by using the text topic vector, and thus directly extracting the deep features of the text; the topic probability vector of any text obtained in the step is as follows: z ═ p1, p2, …, pn, where n is equal in size to h 1.
And step S4, specifically, by means of vector splicing, splicing the vector obtained in step S2 through the pyramid pooling model with the topic probability vector of the text obtained in step S3, then calculating a similarity threshold between the text to be detected and any short text under any topic by using a cosine similarity method, ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for division, thereby completing the classification of the short texts.
The invention has the following beneficial effects:
1. the method can solve the problem of inconsistent text input vector lengths, and the pyramid pooling model can effectively extract global features and local features and prevent the loss of the features.
2. The invention considers not only the semantics of the word level, but also the semantic information of the text level, solves the semantic information of the text level by constructing a text vector matrix, and the pyramid pooling model can reserve the semantic information of the text level to the maximum extent and unify the vector size; the topic probability vector of the text is extracted through the LDA topic model, the implicit meaning of the text can be mined from the deep level, and the deep level features of the text can be directly extracted.
Drawings
Fig. 1 is a flowchart of the steps of the short text classification method described in embodiment 1.
Fig. 2 is a schematic diagram of embodiment 1 in which each word vector has 100 dimensions.
FIG. 3 is a schematic process diagram of the pyramid pooling model of example 1.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
Example 1
As shown in fig. 1, a short text classification method based on pyramid pooling and LDA topic model includes the following steps:
s1: constructing a text vector matrix;
the dimensions of the word vectors for each text are not the same because the number of words in each text is different.
In order to classify texts, the traditional technical means is to obtain text vectors by using an averaging method, but the method easily loses the characteristics of texts, and the method does not consider the frequency of text words, namely the contribution degree of the words.
In order to solve the problem of feature loss caused by an averaging method on a word vector matrix, words of a text are arranged in sequence by adopting vector representation in a similar picture pixel mode. For example, with the development of artificial intelligence, the intelligent manufacturing has qualitative changes in the links of design, production, management and the like, and the design period is shortened by combining the means of artificial intelligence; the production cost is reduced and the production efficiency is improved by adopting a man-machine cooperation mode; the operation cost is greatly reduced by means of real-time monitoring and detection in multiple production links, the qualification rate of products is improved, and the production efficiency of the manufacturing industry is improved.
Through word vector division, n vocabularies are obtained, namely v1, v2, … and vn, namely the expression is V (w) { v ═ v1,v2,...,vnN represents the number of words in the text, and w is the number arranged in word order.
Then the ordering is done in order and assuming that each word vector has 100 dimensions, the final word vector matrix for the short text can be represented as shown in fig. 2.
Since each word vector is arranged in order, a word vector matrix with a size (w, h) is obtained, where w is the number of word vectors arranged in order and h is the number of dimensions each word vector itself has, and h is 100 in this embodiment, but in practice, it needs to be set for practical situations.
S2: based on the word vector matrix constructed in step S1, the following pyramid pooling model is used to fix the vectors of different texts into a unified vector representation:
for an arbitrarily sized word vector matrix, assume its size is (w, h); if the vector needs to be converted into a h 1-dimensional vector with fixed size; namely, inputting the following data at the input layer of the pyramid pooling model: a word vector matrix of arbitrary size, assuming size (w, h); in the present embodiment, in the output layer: 21 neurons, namely, a word vector matrix with any size is input, and 21 submatrices are extracted.
As shown in fig. 3, the feature extraction of the pyramid pooling model is specifically as follows:
s201: dividing a word vector matrix (w, h) with any size into K1 × K1 sub-matrices, wherein K1 in the embodiment takes 4, that is, the size of each sub-matrix is (w/4, h/4);
s202, dividing a word vector matrix (w, h) with any size into K2 × K2 sub-matrices, wherein K2 in the embodiment takes 2, and the size of each sub-matrix is (w/2, h/2);
s203: dividing the word vector matrix (w, h) into K3 × K3 sub-matrices, where K3 takes 1 in this embodiment, that is, the word vector matrix (w, h) is used as a sub-matrix, and the size of the sub-matrix is (w, h);
s204: and fusing the sub-matrixes obtained in the steps S201, S202 and S203 to obtain h1 sub-matrixes, wherein h1 is 4 × 4+2 × 2+1 × 21. Instead of fixing 21-dimensional features, designers can set K1, K2 and K3 according to their requirements to achieve features of different dimensions.
S3: adopting an LDA topic model to extract a text topic probability vector from the text vector in the step S1 to obtain a topic probability vector of the text; specifically, the method comprises the following steps:
after the text vector matrix extracted in the step S1 is obtained, training the text by adopting an LDA topic model to obtain a text topic vector, finding out the meaning implied by the text through the text topic vector, and directly extracting the deep features of the text in the way; the topic probability vector of any text obtained in the step is as follows: z is { p1, p2, …, pn }, where n is equal to h1, and n is 21 in this embodiment.
S4: splicing the vectors after the pyramid pooling model in the step S2 with the topic probability vectors of the texts obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold; specifically, the method comprises the following steps:
by means of vector splicing, the vectors obtained in the step S2 after passing through the pyramid pooling model are spliced with the topic probability vectors of the text obtained in the step S3, and then the similarity between the text to be detected and any short text under any topic is calculated by adopting a cosine similarity method, wherein the specific calculation formula is as follows:
Figure BDA0002315681320000061
where a represents the vector after the pyramid pooling model, and b represents the topic probability vector of the text obtained in step S3.
And ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for division.
S5: and finishing the classification of the short text.
By the short text classification method described in the above embodiment, it has the following effects:
1. in the embodiment, the pyramid pooling model can deal with short texts with different lengths, and the pyramid pooling takes the relation between the text contexts into consideration, so that the relation between the short text contexts or the association between words can be embodied to a certain extent.
2. A word vector matrix is constructed according to the sequence of words, the vector matrix not only considers the spatial distribution of the words, but also considers the frequency relation of the words, and the problem of feature loss is avoided.
3. The embodiment combines a pyramid pooling model and an LDA topic model to provide a short text classification method with universality and high expansibility. The method has the advantages that the problem of feature sparsity caused by text length does not need to be considered, and the accuracy of text classification can be improved to a great extent.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (5)

1. A short text classification method based on pyramid pooling and an LDA topic model is characterized in that: the short text classification method comprises the following steps:
s1: constructing a text vector matrix;
s2: fixing vectors of different texts into a uniform vector representation through a pyramid pooling model;
s3: adopting an LDA topic model to extract a text topic probability vector from the text vector in the step S1 to obtain a topic probability vector of the text;
s4: splicing the vectors after the pyramid pooling model in the step S2 with the topic probability vectors of the texts obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and performing text classification calculation by combining a similarity threshold;
s5: and finishing the classification of the short text.
2. The pyramid pooling and LDA topic model-based short text classification method of claim 1, wherein: step S1, constructing text vectors, specifically, sequencing words of the text in sequence by adopting vector representation; obtaining n words through word vector division, and obtaining V (w) { v ═ v1,v2,...,vnWhere n denotes the number of words in the text and w is the number in order of words
And assumes that each word vector has a dimension of h because each word vector is arranged in order, resulting in a word vector matrix of size (w, h).
3. The pyramid pooling and LDA topic model-based short text classification method of claim 2, wherein: step S2, based on the word vector matrix constructed in step S1, the vectors of different texts are fixed into a unified vector representation by adopting a pyramid pooling model:
for an arbitrarily sized word vector matrix, assume its size is (w, h); if the vector needs to be converted into a h 1-dimensional vector with fixed size; the pyramid pooling model is specifically processed as follows:
s201: dividing a word vector matrix (w, h) with an arbitrary size into K1 × K1 sub-matrices, namely, the size of each sub-matrix is (w/K1, h/K1);
s202, dividing a word vector matrix (w, h) with any size into K2K 2 sub-matrices, wherein the size of each sub-matrix is (w/K2, h/K2);
s203: dividing the word vector matrix (w, h) into K3K 3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);
s204: and fusing the sub-matrixes obtained in the steps S201, S202 and S203 to obtain h1 sub-matrixes, wherein K1, K2 and K3 are positive integers which are different from each other, and h1 is K1K 1+ K2K 2+ K3K 3.
4. The pyramid pooling and LDA topic model-based short text classification method of claim 3, wherein: step S3, after the text vector matrix extracted in step S1 is obtained, an LDA topic model is adopted to train the text to obtain a text topic vector, the meaning implied by the text is found out through the text topic vector, and the direct extraction of the deep features of the text is realized through the way; the topic probability vector of any text obtained in the step is as follows: z ═ p1, p2, …, pn, where n is equal in size to h 1.
5. The pyramid pooling and LDA topic model-based short text classification method of claim 4, wherein: and S4, specifically, splicing the vectors after the pyramid pooling model is passed through in the step S2 and the topic probability vectors of the texts obtained in the step S3 through vector splicing, then calculating the similarity threshold value between the texts to be detected and any short texts under any each topic by adopting a cosine similarity method, ranking the similarity threshold value, selecting the topic of the text with the maximum similarity threshold value as the final topic, and dividing to finish the classification of the short texts.
CN201911276404.2A 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model Active CN111159335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911276404.2A CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911276404.2A CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Publications (2)

Publication Number Publication Date
CN111159335A true CN111159335A (en) 2020-05-15
CN111159335B CN111159335B (en) 2023-06-23

Family

ID=70556839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911276404.2A Active CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Country Status (1)

Country Link
CN (1) CN111159335B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160070731A1 (en) * 2014-09-10 2016-03-10 Adobe Systems Incorporated Analytics based on scalable hierarchical categorization of web content
WO2018125580A1 (en) * 2016-12-30 2018-07-05 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN108536870A (en) * 2018-04-26 2018-09-14 南京大学 A kind of text sentiment classification method of fusion affective characteristics and semantic feature
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN109739951A (en) * 2018-12-25 2019-05-10 广东工业大学 A kind of text feature based on LDA topic model
CN110378484A (en) * 2019-04-28 2019-10-25 清华大学 A kind of empty spatial convolution pyramid pond context learning method based on attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160070731A1 (en) * 2014-09-10 2016-03-10 Adobe Systems Incorporated Analytics based on scalable hierarchical categorization of web content
WO2018125580A1 (en) * 2016-12-30 2018-07-05 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN108536870A (en) * 2018-04-26 2018-09-14 南京大学 A kind of text sentiment classification method of fusion affective characteristics and semantic feature
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN109739951A (en) * 2018-12-25 2019-05-10 广东工业大学 A kind of text feature based on LDA topic model
CN110378484A (en) * 2019-04-28 2019-10-25 清华大学 A kind of empty spatial convolution pyramid pond context learning method based on attention mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics

Also Published As

Publication number Publication date
CN111159335B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN110825845B (en) Hierarchical text classification method based on character and self-attention mechanism and Chinese text classification method
CN107705066B (en) Information input method and electronic equipment during commodity warehousing
CN112966522A (en) Image classification method and device, electronic equipment and storage medium
CN109284406B (en) Intention identification method based on difference cyclic neural network
CN107015963A (en) Natural language semantic parsing system and method based on deep neural network
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN102289522A (en) Method of intelligently classifying texts
CN104199857A (en) Tax document hierarchical classification method based on multi-tag classification
CN109034248B (en) Deep learning-based classification method for noise-containing label images
CN112417153B (en) Text classification method, apparatus, terminal device and readable storage medium
CN108154156B (en) Image set classification method and device based on neural topic model
CN110019653B (en) Social content representation method and system fusing text and tag network
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN115982403A (en) Multi-mode hash retrieval method and device
CN113901289A (en) Unsupervised learning-based recommendation method and system
CN113590827B (en) Scientific research project text classification device and method based on multiple angles
CN111159335A (en) Short text classification method based on pyramid pooling and LDA topic model
CN104123336A (en) Deep Boltzmann machine model and short text subject classification system and method
CN109543038A (en) A kind of sentiment analysis method applied to text data
He et al. Classification of metro facilities with deep neural networks
Sun et al. A hybrid approach to news recommendation based on knowledge graph and long short-term user preferences
CN105354264B (en) A kind of quick adding method of theme label based on local sensitivity Hash
CN112257448A (en) Multitask named entity identification method, system, medium and terminal
CN117150148A (en) Social network public opinion situation monitoring method based on pre-training model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant