CN111159335B - Short text classification method based on pyramid pooling and LDA topic model - Google Patents

Short text classification method based on pyramid pooling and LDA topic model Download PDF

Info

Publication number
CN111159335B
CN111159335B CN201911276404.2A CN201911276404A CN111159335B CN 111159335 B CN111159335 B CN 111159335B CN 201911276404 A CN201911276404 A CN 201911276404A CN 111159335 B CN111159335 B CN 111159335B
Authority
CN
China
Prior art keywords
text
vector
topic
vectors
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911276404.2A
Other languages
Chinese (zh)
Other versions
CN111159335A (en
Inventor
陈雍君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 7 Research Institute
Original Assignee
CETC 7 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 7 Research Institute filed Critical CETC 7 Research Institute
Priority to CN201911276404.2A priority Critical patent/CN111159335B/en
Publication of CN111159335A publication Critical patent/CN111159335A/en
Application granted granted Critical
Publication of CN111159335B publication Critical patent/CN111159335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a short text classification method based on pyramid pooling and an LDA topic model, which comprises the following steps: constructing a text vector matrix; fixing the vectors of different texts into uniform vector representation through a pyramid pooling model; extracting text topic probability vectors from the text vectors by adopting an LDA topic model to obtain topic probability vectors of the text; splicing the vector passing through the pyramid pooling model with the topic probability vector of the obtained text, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold; and finishing the classification of the short text. The method and the device not only consider the spatial distribution of the words, but also consider the frequency relation of the words, avoid the problem of feature loss, and effectively improve the accuracy of classifying the short texts.

Description

Short text classification method based on pyramid pooling and LDA topic model
Technical Field
The invention relates to the technical field of neural networks, in particular to a short text classification method based on pyramid pooling and an LDA topic model.
Background
In journal editing work, there is a task of distributing a manuscript-examining task to experts in different fields according to the difference of the subjects of the abstract of the author. I.e. short text needs to be classified. Before classifying short texts, firstly, a method is needed to acquire similarity among the short texts so as to classify the short texts.
It is known that short text is very short in length and extremely large in number, so that if short text is vector-represented, there is a problem of high-dimensional sparsity. To address this problem, many researchers have employed short text classification based on combining topic models and word vectors; researchers have also used words of short text to augment the semantic information of short text with word stock. However, the above method has several problems:
(1) The association of words in the short text is not considered, nor is the context between words. That is, the words of the text are assumed to be the same, but if the words are randomly arranged, the text classification is performed according to the method of the previous researcher, and the obtained text classification results are the same. However, the Chinese words have ambiguity, and the arrangement and combination of different words are likely to express different effects, so that the context relationship and the association relationship between the text words do need to be considered when the short text classification is performed.
(2) The problem of inconsistent word numbers and frequencies between different texts is not considered. For example, a text has 15 words, and some words appear multiple times; the B text also has 15 words which are mostly the same, but none of the words appears 2 times or more, and if the short text is classified by using the traditional combination of the topic model and the word vector, the classification results are the same, because LDA (implicit dirichlet allocation, latent Dirichlet Allocation) root-pressing does not consider the frequency of the occurrence of a word in a short text, and the topic is defined only from the word itself.
An example is given below:
a text: 5G network layout technology is realized by adopting artificial intelligence field, and … … development of mobile communication related industry is combined
B text: the development of the artificial intelligence field adopts the artificial intelligence technology and combines the speed of the 5G mobile communication network to carry out related industry layout. Techniques … … by artificial intelligence
From the above example, the word vectors of the word a and the word vector of the word B are mostly the same, and all have artificial intelligence, network, mobile communication, industry, layout and the like, if the topic model and the word vector are adopted to classify short text, the word vectors of the word a and the word vector are likely to be similar, and the word vectors of the word a and the word vector B are likely to be classified into the same class by combining the topic probability vector, however, it is obvious that the technology of the artificial intelligence is proposed in the article B for many times, namely an article in the field of the artificial intelligence, and the article a only adopts the artificial intelligence to carry out 5G network layout, namely an article in the field of the mobile communication. Thus, while conducting the manuscript-examining task, the computer is likely to make an erroneous decision, and 2 such articles are all assigned to experts in the same field.
(3) After obtaining text vectors, the number of words of different texts is different, and the dimensions of the text vectors are different (the dimensions of the vectors of the texts are naturally different because the number of words is different), and a continuous pooling and downsampling method is generally adopted to enable text feature representations to be more abstract, the general practice is to put the text feature representations into a CNN (convolutional neural network) to extract high-level features, and one defect in doing so is to cause the resolution of the text representation feature representations to be reduced quickly, so that the meaning expressed by some key words is likely to be covered.
Disclosure of Invention
The invention aims to solve the problems that the current short text classification technology does not consider the context relation and the incidence relation among text words, and meanwhile, the inconsistent word quantity and frequency among different texts are not considered, so that the text classification accuracy is low, and therefore, the invention provides a short text classification method based on pyramid pooling and an LDA topic model, which not only considers the spatial distribution of words, but also considers the frequency relation of words, avoids the problem of feature loss, and effectively improves the short text classification accuracy.
In order to achieve the above purpose of the present invention, the following technical scheme is adopted: a short text classification method based on pyramid pooling and LDA topic model, the short text classification method comprising the steps of:
s1: constructing a text vector matrix;
s2: fixing the vectors of different texts into uniform vector representation through a pyramid pooling model;
s3: extracting text topic probability vectors from the text vectors in the step S1 by adopting an LDA topic model to obtain topic probability vectors of the text;
s4: splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold;
s5: and finishing the classification of the short text.
Preferably, in step S1, the text vector is constructed, specifically, the words of the text are sequenced in sequence by using vector representation; through word vector division, n words are obtained, and V (w) = { V 1 ,v 2 ,...,v n Wherein n represents the number of words in the text and w is the number arranged in word order
And each word vector is assumed to have an h dimension because each word vector is arranged in order, resulting in a word vector matrix of size (w, h).
Further, in step S2, based on the word vector matrix constructed in step S1, the following pyramid pooling model is used to fix the vectors of different texts into a unified vector representation:
for a word vector matrix of arbitrary size, assume its size as (w, h); if the h 1-dimensional vector is required to be converted into a h 1-dimensional vector with a fixed size; the specific processing of the pyramid pooling model is as follows:
s201: dividing a word vector matrix (w, h) with any size into K1X K1 submatrices, namely, the size of each submatrix is (w/K1, h/K1);
s202, dividing a word vector matrix (w, h) with any size into K2X K2 submatrices, wherein the size of each submatrix is (w/K2, h/K2);
s203: dividing a word vector matrix (w, h) into K3 x K3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);
s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein K1, K2 and K3 are different positive integers, and h1=k1+k1+k2+k3+k3.
Further, step S3, after the text vector matrix is extracted in step S1, training the text by adopting an LDA topic model to obtain a text topic vector, finding out the implied meaning of the text by the text topic vector, and directly extracting the deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1 in size.
And step S4, specifically, splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating a similarity threshold between the text to be detected and any short text under any topic by adopting a cosine similarity method, ranking the similarity threshold, and selecting the topic of the text with the maximum similarity threshold as a final topic for classification, thereby completing classification of the short text.
The beneficial effects of the invention are as follows:
1. the method and the device can solve the problem of inconsistent lengths of text input vectors, and the pyramid pooling model can not only effectively extract global features and local features, but also prevent the problem of feature loss.
2. According to the invention, the semantic of the word level is considered, the semantic information of the text level is also considered, the semantic information of the text level is solved by constructing the text vector matrix, and the pyramid pooling model can reserve the semantic information of the text level to the maximum extent and unify the vector size; the topic probability vector of the text is extracted through the LDA topic model, the implicit meaning of the text can be mined from depth, and the deep features of the text can be directly extracted.
Drawings
FIG. 1 is a flow chart of the steps of the short text classification method described in example 1.
FIG. 2 is a schematic diagram of embodiment 1 with 100 dimensions for each word vector.
FIG. 3 is a schematic illustration of the process of the pyramid pooling model of example 1.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
Example 1
As shown in fig. 1, a short text classification method based on pyramid pooling and LDA topic model, the short text classification method comprises the following steps:
s1: constructing a text vector matrix;
because the number of words in each text is different, the dimensions of the word vector are different for each text.
In order to realize text classification, the traditional technical means is to obtain text vectors by adopting an average method, but the method is easy to lose the characteristics of the text, and the method does not consider the frequency of text words, namely the contribution degree of the words.
In order to solve the problem of feature loss caused by an average method on a word vector matrix, words of a text are arranged in sequence by adopting vector representation in a mode similar to picture pixels. For example, with the development of artificial intelligence, intelligent manufacturing has changed in design, production, management and other links, and the design period is shortened by combining with artificial intelligence means; the production cost is reduced by adopting a man-machine cooperation mode, and the production efficiency is improved; the operation cost is greatly reduced by means of real-time monitoring and detection in multiple production links, the qualification rate of products is improved, and the production efficiency of the manufacturing industry is improved.
Through word vector division, n vocabularies are obtained, namely V1, V2, … and vn are respectively expressed as V (w) = { V 1 ,v 2 ,...,v n Where n represents the number of words in the text and w is the number arranged in word order.
Then the word vector matrix for the last short text may be represented as shown in fig. 2, assuming each word vector has 100 dimensions, ordered in order.
Since each word vector is arranged in order, a word vector matrix of a size (w, h) is obtained, where w is the number arranged in the word order, and h is the dimension possessed by each word vector itself, which is set to h=100 in the present embodiment, but in practice, setting is also required for practical cases.
S2: based on the word vector matrix constructed in the step S1, the pyramid pooling model is adopted to fix the vectors of different texts into uniform vector representation:
for a word vector matrix of arbitrary size, assume its size as (w, h); if the h 1-dimensional vector is required to be converted into a h 1-dimensional vector with a fixed size; i.e. input at the input layer of the pyramid pooling model: a word vector matrix of arbitrary size, assuming a size (w, h); the embodiment is arranged at the output layer: the 21 neurons, namely, the word vector matrix with any size is input, and the 21 submatrices are extracted.
As shown in fig. 3, the feature extraction of the pyramid pooling model is specifically as follows:
s201: dividing a word vector matrix (w, h) with any size into K1 x K1 sub-matrices, wherein in the embodiment K1 is 4, namely the size of each sub-matrix is (w/4,h/4);
s202, dividing a word vector matrix (w, h) with any size into K2X K2 submatrices, wherein in the embodiment K2 is 2, and the size of each submatrix is (w/2,h/2);
s203: dividing the word vector matrix (w, h) into K3×k3 sub-matrices, and taking 1 in this embodiment K3, that is, taking the word vector matrix (w, h) as a sub-matrix, where the size of the sub-matrix is (w, h);
s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein h1=4×4+2×2+1×1=21. Instead of fixing the 21-dimensional features, the designer can set K1, K2, K3 to achieve features of different dimensions as required by himself.
S3: extracting text topic probability vectors from the text vectors in the step S1 by adopting an LDA topic model to obtain topic probability vectors of the text; specifically:
training a text by adopting an LDA topic model after the text vector matrix is extracted in the step S1 to obtain a text topic vector, finding out the implied meaning of the text by the text topic vector, and directly extracting the deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1, this embodiment takes n=21.
S4: splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold; specific:
and (3) splicing the vector obtained in the step (S2) through the pyramid pooling model with the topic probability vector of the text obtained in the step (S3) through vector splicing, and then calculating the similarity between the text to be detected and any short text under any topic by adopting a cosine similarity method, wherein the specific calculation formula is as follows:
Figure BDA0002315681320000061
wherein a represents a vector after pyramid pooling model, and b represents a topic probability vector of the text obtained in the step S3.
And selecting the topic of the text with the maximum similarity threshold as a final topic for division by ranking the similarity thresholds.
S5: and finishing the classification of the short text.
The short text classification method described by the above embodiment has the following effects:
1. the pyramid pooling model is adopted in the embodiment, short texts with different lengths can be handled, and the relationship between the short text contexts or the association relationship between words can be reflected to a certain extent because the pyramid pooling already takes the relationship between the text contexts into consideration.
2. According to the sequence of the words, a word vector matrix is constructed, the vector matrix not only considers the spatial distribution of the words, but also considers the frequency relation of the words, and the problem of feature loss is avoided.
3. The embodiment fuses the pyramid pooling model and the LDA theme model to provide a short text classification method with universality and high expansibility. According to the method, the problem of feature sparsity caused by the fact that the text length does not need to be considered is solved, and the accuracy of text classification can be improved to a great extent.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (3)

1. A short text classification method based on pyramid pooling and LDA topic model is characterized in that: the short text classification method comprises the following steps:
s1: constructing a text vector matrix;
s2: fixing the vectors of different texts into uniform vector representation through a pyramid pooling model;
s3: extracting text topic probability vectors from the text vectors in the step S1 by adopting an LDA topic model to obtain topic probability vectors of the text;
s4: splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3, calculating the similarity between the texts by adopting a cosine similarity formula, and carrying out text classification calculation by combining a similarity threshold;
s5: completing classification of short texts;
step S1, constructing text vectors, namely sequencing words of the text in sequence by adopting vector representation; through word vector division, n words are obtained,
Figure QLYQS_1
where n represents the number of words in the text and w is the number arranged in word order
And assuming that each word vector has an h dimension because each word vector is arranged in order, thereby obtaining a word vector matrix of (w, h) size;
step S2, based on the word vector matrix constructed in the step S1, fixing vectors of different texts into uniform vector representation by adopting a pyramid pooling model:
for a word vector matrix of arbitrary size, assume its size as (w, h); if the h 1-dimensional vector is required to be converted into a h 1-dimensional vector with a fixed size; the specific processing of the pyramid pooling model is as follows:
s201: dividing a word vector matrix (w, h) with any size into K1X K1 submatrices, namely, the size of each submatrix is (w/K1, h/K1);
s202, dividing a word vector matrix (w, h) with any size into K2X K2 submatrices, wherein the size of each submatrix is (w/K2, h/K2);
s203: dividing a word vector matrix (w, h) into K3 x K3 sub-matrices, wherein the size of each sub-matrix is (w/K3, h/K3);
s204: and fusing the submatrices obtained in the steps S201, S202 and S203 to obtain h1 submatrices, wherein K1, K2 and K3 are different positive integers, and h1=K1×K1+K2×K2+K3×K3.
2. The short text classification method based on pyramid pooling and LDA topic model as claimed in claim 1, wherein: step S3, training the text by adopting an LDA topic model after the text vector matrix obtained by the extraction in the step S1 to obtain text topic vectors, finding out implied meanings of the text by the text topic vectors, and directly extracting deep features of the text in the mode; the topic probability vector of any text obtained by the step is as follows: z= { p1, p2, …, pn }, where n is equal to h1 in size.
3. The short text classification method based on pyramid pooling and LDA topic model as claimed in claim 2, wherein: and S4, specifically, splicing the vector obtained in the step S2 through the pyramid pooling model with the topic probability vector of the text obtained in the step S3 through vector splicing, calculating a similarity threshold value between the text to be detected and any short text under any topic by adopting a cosine similarity method, ranking the similarity threshold values, and selecting the topic of the text with the maximum similarity threshold value as a final topic to divide, thereby completing the classification of the short text.
CN201911276404.2A 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model Active CN111159335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911276404.2A CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911276404.2A CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Publications (2)

Publication Number Publication Date
CN111159335A CN111159335A (en) 2020-05-15
CN111159335B true CN111159335B (en) 2023-06-23

Family

ID=70556839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911276404.2A Active CN111159335B (en) 2019-12-12 2019-12-12 Short text classification method based on pyramid pooling and LDA topic model

Country Status (1)

Country Link
CN (1) CN111159335B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10417301B2 (en) * 2014-09-10 2019-09-17 Adobe Inc. Analytics based on scalable hierarchical categorization of web content
US20190205758A1 (en) * 2016-12-30 2019-07-04 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN108536870B (en) * 2018-04-26 2022-06-07 南京大学 Text emotion classification method fusing emotional features and semantic features
CN109376226A (en) * 2018-11-08 2019-02-22 合肥工业大学 Complain disaggregated model, construction method, system, classification method and the system of text
CN109739951A (en) * 2018-12-25 2019-05-10 广东工业大学 A kind of text feature based on LDA topic model
CN110378484A (en) * 2019-04-28 2019-10-25 清华大学 A kind of empty spatial convolution pyramid pond context learning method based on attention mechanism

Also Published As

Publication number Publication date
CN111159335A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN109783818B (en) Enterprise industry classification method
CN110059181B (en) Short text label method, system and device for large-scale classification system
CN109284406B (en) Intention identification method based on difference cyclic neural network
CN102207945B (en) Knowledge network-based text indexing system and method
CN105022754B (en) Object classification method and device based on social network
CN109086375B (en) Short text topic extraction method based on word vector enhancement
CN110046634B (en) Interpretation method and device of clustering result
CN102289522A (en) Method of intelligently classifying texts
CN107832458A (en) A kind of file classification method based on depth of nesting network of character level
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN112417153B (en) Text classification method, apparatus, terminal device and readable storage medium
CN110968692B (en) Text classification method and system
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN107357895B (en) Text representation processing method based on bag-of-words model
CN113822419B (en) Self-supervision graph representation learning operation method based on structural information
Tripathi et al. Real time object detection using CNN
Olaode et al. Unsupervised image classification by probabilistic latent semantic analysis for the annotation of images
CN111159335B (en) Short text classification method based on pyramid pooling and LDA topic model
Meng et al. Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection
Qin et al. Application of video scene semantic recognition technology in smart video
CN106033546B (en) Behavior classification method based on top-down learning
CN113010705A (en) Label prediction method, device, equipment and storage medium
Schmitt et al. Outlier detection on semantic space for sentiment analysis with convolutional neural networks
CN116089886A (en) Information processing method, device, equipment and storage medium
CN112884053B (en) Website classification method, system, equipment and medium based on image-text mixed characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant