CN112395413B - Text analysis method based on multiple deep topic models - Google Patents

Text analysis method based on multiple deep topic models Download PDF

Info

Publication number
CN112395413B
CN112395413B CN201910750551.2A CN201910750551A CN112395413B CN 112395413 B CN112395413 B CN 112395413B CN 201910750551 A CN201910750551 A CN 201910750551A CN 112395413 B CN112395413 B CN 112395413B
Authority
CN
China
Prior art keywords
training
layer
text
sample set
multiple deep
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910750551.2A
Other languages
Chinese (zh)
Other versions
CN112395413A (en
Inventor
陈渤
陈文超
赵倩茹
刘应祺
刘宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910750551.2A priority Critical patent/CN112395413B/en
Publication of CN112395413A publication Critical patent/CN112395413A/en
Application granted granted Critical
Publication of CN112395413B publication Critical patent/CN112395413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a text analysis method based on a multiple deep topic model, which comprises the steps of constructing a training sample set and a test sample set of text data; constructing a multiple deep topic model according to the training sample set, and initializing initial model parameters of the multiple deep topic model; training the multiple deep topic models according to the training sample set to obtain training model parameters, and updating initial model parameters according to the training model parameters to obtain trained multiple deep topic models; training the trained multiple deep topic models according to the test sample set to obtain a plurality of test hidden layer characteristics; performing visual analysis on training model parameters according to the hidden layer characteristics to obtain a plurality of text topics; and classifying the text data according to a plurality of text topics, a training sample set, the tested hidden layer features and the tested multiple deep topic models. The invention can comprehensively reflect the characteristics of the text data, so that the text subject has better separability and high text analysis capability.

Description

Text analysis method based on multiple deep topic models
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text analysis method based on a multiple deep topic model.
Background
With the rapid development of mobile internet and information technology, the big data age has come. Efficient processing and analysis methods are needed for massive data in a huge variety of networks. In particular, text-type data often contains huge amounts of information, and government, business and personal demands for intelligent text analysis are increasing, so that natural language processing technology is further developed. The topic model is used as a text mining method, can effectively extract text features, discovers potential semantic topics in text data, and is widely applied to text analysis tasks in the fields of machine learning and data mining, such as text clustering, hot spot mining, emotion analysis, information retrieval, recommendation systems and the like. At present, the existing topic models are mainly based on a classical model, namely potential dirichlet allocation (Latent Dirichlet Allocation, LDA), and various topic models are proposed by carrying out corresponding expansion research in combination with the application field and the data characteristics thereof. Meanwhile, the Gibbs sampling method is widely applied to parameter learning and variable inference of a topic model.
The existing method has the defects that: the LDA topic model cannot be used for extracting deep semantic feature topics, hierarchical text analysis is difficult to perform, while the existing deep topic model can extract deep features, the extracted high-level topics are poor in diversity, the expression capability of the high-level semantic features is limited, the hierarchical feature extraction effect is affected, and the task performances such as subsequent text classification are poor; moreover, the deep topic model is trained by adopting the traditional Gibbs sampling method, the calculated amount is large, the convergence speed is low, the existing improved Gibbs sampling method with high convergence speed is not suitable for big data scenes needing online training, parallel training is difficult, and the practicability is limited.
A text depth feature extraction method based on a variational self-coding model is disclosed in the patent literature of the university of western electronic technology applied for (patent application No. 201810758180.8, publication No. 109145288 a). The method constructs a variational self-coding reasoning model which can be used for extracting deep topic keywords, takes an input document as training data and test data, and extracts two layers of topic keywords as corresponding text depth feature extraction results. The method has the defects that although deep text features can be extracted, as the number of layers is increased, the extracted topic keywords are higher in similarity and poorer in diversity, and have no better separability, so that the subsequent text analysis capability can be influenced.
An LDA topic model optimization sampling method is disclosed in patent literature (patent application number 201810493178.2, publication number 108763207A) applied by Nanjing university in the same university. According to the method, the method of decomposing the Gibbs sampling formula, constructing the Alias Table and accumulating the distribution is utilized, so that the method of constructing multiple times of sampling once is realized, and the convergence rate of training and learning of the LDA topic model is improved. However, the method needs to input text data once for sampling to learn parameters of the topic model, and when the data volume is large, parallel training is difficult due to the limitation of the calculation capacity of the hardware of the computer, and the method is not suitable for big data scenes and has limited practicability.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a text analysis method based on a multi-deep topic model. The technical problems to be solved by the invention are realized by the following technical scheme:
a text analysis method based on a multiple deep topic model comprises the following steps:
constructing a training sample set and a test sample set of text data;
constructing a multiple deep topic model according to the training sample set, and initializing initial model parameters of the multiple deep topic model;
training the multiple deep topic models according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain trained multiple deep topic models;
training the trained multiple deep topic models according to the test sample set to obtain a plurality of test hidden layer characteristics;
performing visual analysis on the text data according to the test hidden layer characteristics to obtain a plurality of text topics;
and classifying the text data according to the text topics, the training sample set, the test hidden layer features and the tested multiple deep topic model.
In one embodiment of the present invention, a multiple deep topic model is constructed according to the training sample set, and initial model parameters of the multiple deep topic model are initialized, wherein the initial model parameters are hidden layer features, and the method includes:
obtaining hidden layer characteristics of a plurality of various deep topic models according to the various deep topic models, wherein the hidden layer characteristics of the various deep topic models comprise various hidden variables and shared hidden variables;
initializing a plurality of the multiple hidden variables and the shared hidden variable.
In one embodiment of the present invention, training the multiple deep topic model according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep topic model, where the training model parameters are training hidden features, including:
dividing the training sample set into a plurality of training data sets;
analyzing the training data sets in the multiple deep topic models for several times to obtain a plurality of training model parameters;
and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep theme model.
In one embodiment of the present invention, classifying the text data according to the training sample set, the test hidden layer feature and the post-test multiple deep topic model includes:
classifying the training support vector machines with the training according to the training sample set, the text topics and the test hidden layer characteristics;
classifying the test sample set according to the training support vector machine classification to obtain a predicted text class label;
and comparing the category label of the prediction sample set with the category label of the prediction text to obtain the text classification accuracy and complete classification of the test sample set.
The invention has the beneficial effects that:
according to the invention, through extracting the deep features of the text data and carrying out text analysis from each layer, the extracted hidden feature information is more sufficient, the characteristics of the text data can be reflected more comprehensively, and the performance of tasks such as text classification and the like is improved; according to the invention, the shared hidden variable and the various hidden variables are respectively used for representing the public theme characteristics and the various theme characteristics corresponding to the text, so that the theme with better diversity can be extracted, the problems of higher similarity and poorer diversity of the theme keywords extracted by the existing deep theme model are solved, the extracted text theme has good separability, and the subsequent text analysis capability is improved; the invention adopts a random gradient Markov Monte Carlo sampling method, so that the model can be suitable for big data scenes, the model can be trained in parallel faster, and the practicability of the model is improved.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
FIG. 1 is a block diagram of a text analysis method based on a multiple deep topic model according to an embodiment of the present invention;
FIG. 2 is a simulation diagram of a first layer of visual theme of text data of a text analysis method based on a deep theme model according to an embodiment of the present invention;
FIG. 3 is a simulation diagram of a third layer of visual theme of text data of a text analysis method based on a multiple deep theme model according to an embodiment of the present invention;
fig. 4 is a simulation diagram of a text analysis method for verifying hierarchical feature separability based on a multiple deep topic model according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.
Referring to fig. 1, fig. 1 is a block diagram of a text analysis method based on a multiple deep topic model according to an embodiment of the present invention, including:
constructing a training sample set and a test sample set of text data;
constructing a multiple deep topic model according to the training sample set, and initializing initial model parameters of the multiple deep topic model;
training the multiple deep topic models according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain trained multiple deep topic models;
training the trained multiple deep topic models according to the test sample set to obtain a plurality of test hidden layer characteristics;
performing visual analysis on the text data according to the test hidden layer characteristics to obtain a plurality of text topics;
and classifying the text data according to the text topics, the training sample set, the test hidden layer features and the tested multiple deep topic model.
Further, when a training sample set and a test sample set of text data are constructed, selecting a document from the training sample set as an input document, and preprocessing the input data to obtain word bag vectors corresponding to the N input documents respectively; the bag-of-words vectors of the N documents are divided into two parts, wherein 70% of the bag-of-words vectors form a training sample set, and the rest bag-of-words vectors form a test sample set.
The preprocessing operation includes:
step 1, counting the total number of all words appearing in N documents, and recording the total number as M words:
M=M 1 +M 2 +M 3 +M 4
wherein M is 1 For the number of nouns in all words, M 2 For the number of verbs in all words, M 3 M is the number of adjectives in all words 4 The number of other part-of-speech words in all words;
step 2, merging M for the M words obtained in step 1 1 Single complex form of individual nounsNouns, merge M 2 The different states of the individual verbs get +.>The verb reserves M 3 The adjectives, M of other parts of speech being deleted 4 Individual words, thereby obtaining +.>Individual words as dictionary->
The combination M 1 The specific steps of the single complex form of the individual nouns are as follows:
reserved M 1 All singular nouns in a noun, M 1 The rest plural form nouns in the nouns are all correspondingly converted into singular form nouns to obtain M 1 ' singular nouns, then M 1 All repeated single nouns in' single nouns are kept one, the rest are deleted, M 1 All singular nouns in' singular nouns appearing only once are reserved entirely, thereby obtainingSingular nouns @, ->
The combination M 2 The specific steps of the different tenses of the individual verbs are as follows:
reserved M 2 All general now temporal verbs in the individual verbs, M 2 All the remaining tense verbs of the individual verbs are correspondingly converted into corresponding general current verbs
In tense verb, get M 2 ' general now temporal verb, then M 2 All repeatedly appearing general current tense verbs in the 'general current tense verbs' respectively keep one and the rest deleted, and M is added into the general current tense verbs 2 All the general current tense verbs which only appear once in the' general current tense verbs are reserved completely, so that the obtained product is obtainedThe person generally presents a temporal verb->
Step 3, in the statistical dictionaryThe number of times that each word appears in N documents respectively, and a bag-of-words vector X corresponding to N input documents is obtained:
wherein X is n For the bag of words vector corresponding to the nth document,respectively 1 st, 2 nd, … th and +.>Number of occurrences of the individual word in the nth document.
Further, when constructing the multiple deep topic model according to the training sample set, setting the total hidden layer number of the multiple deep topic model as L layers, and setting the hidden layer feature dimension of the first layer as K l Dimension (l=1,., L.) each layer corresponds to a parameter having a matrix of variables-hidden parameters for that layerShared hidden variable parameter matrix theta π
Determining multiple hidden variable parameter matrix of each layer according to the following method
Wherein,the first layer is a multiple hidden variable parameter matrix with the dimension K l XJ-is equivalent relation, gam (& gt) is gamma distribution, S (l) The shape parameter of the gamma distribution of the multiple hidden variables of the first layer, A (l) Diverse hidden variable gamma distribution for layer IIs used for the amplitude parameter of the (c).
Determining a shared hidden variable parameter matrix of each layer according to the following formula:
wherein,for the shared hidden variable parameter matrix of layer I, -/->Shape parameters of the shared hidden variable gamma distribution for layer I, < ->The magnitude parameter of the gamma distribution is the shared hidden variable of the first layer.
The shape parameter S of the first layer multiple hidden variable gamma distribution (l) The method is set according to the following formula:
wherein mu (l+1) To adjust constant factors for adjusting the layer 1 semantic topics and the common term topics, Φ (l+1) Pi for a diverse global variable matrix for characterizing layer 1 semantic topics (l+1) For a shared global variable matrix used to characterize the subject of the common item at layer l+1, r is the shape parameter of the diverse hidden variable gamma distribution at layer L, i.e. the top layer.
Setting input data x of the multiple deep topic model to follow the following distribution:
wherein, pois (·) is poisson distribution.
According to the invention, through extracting the deep features of the text data and carrying out text analysis from each layer, the extracted hidden feature information is more sufficient, the characteristics of the text data can be reflected more comprehensively, and the performance of tasks such as text classification and the like is improved; according to the invention, the shared hidden variable and the various hidden variables are respectively used for representing the public theme characteristics and the various theme characteristics corresponding to the text, so that the theme with better diversity can be extracted, the problems of higher similarity and poorer diversity of the theme keywords extracted by the existing deep theme model are solved, the extracted text theme has good separability, and the subsequent text analysis capability is improved; the invention adopts a random gradient Markov Monte Carlo sampling method, so that the model can be suitable for big data scenes, the model can be trained in parallel faster, and the practicability of the model is improved.
In one embodiment of the present invention, a multiple deep topic model is constructed according to the training sample set, and initial model parameters of the multiple deep topic model are initialized, wherein the initial model parameters are hidden layer features, and the steps include:
obtaining hidden layer characteristics of a plurality of various deep topic models according to the various deep topic models, wherein the hidden layer characteristics of the various deep topic models comprise various hidden variables and shared hidden variables;
initializing a plurality of the multiple hidden variables and the shared hidden variable.
Further, when initializing initial model parameters of the multiple deep-layer theme model, setting the hidden layer number of the multiple deep-layer theme model to be 3 layers, and setting the dimension K of the hidden layer characteristics of the 1 st layer 1 256 dimensions, dimension K of layer 2 hidden layer feature 2 Dimension K of layer 3 hidden layer feature of 128 dimension 3 Is 64 dimensions.
1. Initializing multiple hidden variable parameter matrices for each layer
First, initializing shape parameters S of various hidden variable gamma distributions of each hidden layer (l) (l=1,2,3):
(1) Setting an adjustment constant factor for adjusting the semantic topics and the public item topics;
μ (l) =0.5,l=1,2,3;
(2) Multiple global variable matrices for each hidden layerSharing global variable matrix pi (l) L=1, 2, 3;
wherein,column 1, …, K of the layer 1 diverse global variable matrix l Column vectors, the dimensions of which are K l-1 X 1, dir (·) is dirichlet distribution;
(3) Shape parameters of diverse hidden variable gamma distribution to top layer (i.e., layer 3)Initializing;
wherein, gamma 0 And c 0 The parameters are initialized by the gamma distribution:
γ 0 ~Gam(0.01,100),c 0 ~Gam(1,1);
next, the amplitude parameter A of the multiple hidden variable gamma distribution of each layer is initialized (l) (l=1,2,3);
(1) Amplitude parameter A of the diverse hidden variable gamma distribution for the first layer (1) Initializing;
A (1) =p/(1-p),
wherein the p-parameter is initialized by beta distribution:
p~Beta(0.01,0.01);
(2) Amplitude parameter a of variable-hiding gamma distribution for second and third layers π Initializing;
1/A (2) ,1/A (3) ~Gam(1,1)。
2. initializing a shared hidden variable parameter matrix theta of each layer π
For the first layer, the second layer and the third layer, the shape parameter a of the shared hidden variable gamma distribution π And amplitude parameter b π Initializing:
in one embodiment of the present invention, training the multiple deep topic model according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep topic model, where the training model parameters are training hidden features, the steps include:
dividing the training sample set into a plurality of training data sets;
analyzing the training data sets in the multiple deep topic models for several times to obtain a plurality of training model parameters;
and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep theme model.
Further, acquiring a mini block data set for training the multiple deep topic model according to the training text set: the training data set is randomly shuffled and divided into 200 sample-number mini-block data sets, and if the number of remaining samples is less than 200, the remaining sample data will not be divided into the mini-block data sets.
Setting the total scanning times of the mini block data sets to be 8000, scanning the divided mini block data sets one by one, and setting the cycle training times of each mini block data set to be 40. The specific steps in each cycle training for each mini-block dataset are as follows:
1. obtaining local parameters in the model;
(1) Obtaining the two-dimensional interlayer increment of the first layer byWide and diverse matrixSharing matrix with two-dimensional interlayer augmentation
Wherein w, g are matrix index numbers, w=1, 2,.. l-1 ,K 0 V, v is the dimension of each input sample, g=1, 2,..j; mult (·) is a polynomial distribution; x is x (l) For a two-dimensional inter-layer augmentation matrix, when l=1, x (1) Is the input data x, when l of the multi-deep topic model>At 1, x (l) A two-dimensional interlayer amplification matrix between the first layer and the second layer;is x (l) Elements at the w-th row-g-th column position; phi (phi) w: (l) Is phi (l) The element of row w of the matrix; />Is->Elements on the g-th column of (2); />For vector->Is an element of dimension g;
(2) Obtaining a three-dimensional augmentation matrix T of the first layer through the following steps (l) Vectors at w-th row-g-th column positions of (c)
Wherein w, g, h are matrix subscript numbers, w, g is as (1), h=1, 2,.. l ;x (l)(1) An augmented diversity matrix for two-dimensional layers between the first layer and the second layer;is x (l)(1) Elements at the w-th row-g-th column position; phi (phi) wh (l) Is phi (l) Elements in the position of the w-th row and h-th column of the matrix; />Is->An element at the h row g column position of (a);
(3) Obtaining a two-dimensional interlayer amplification matrix x between the first layer and the first layer (1+1) (l+1) Vectors at row w and column g positions;
wherein, the CRT is the largest dining table distribution in the Chinese restaurant process,is x (l+1) Elements at the w-th row-g-th column position; />Is phi (l) W th row of matrix kth l+1 Elements of a column; />Is->Is the kth of (2) l+1 Elements on row g;for vector->Is an element of dimension g;
(4) Using the following pair c 0 Updating;
wherein y=1,.. 2 The equivalent relation is about;
(5) Gamma is prepared by the following method 0 Updating;
wherein, the equivalent relation is about the same as the first one, w=1..k 2 ;R=1,...,J;
(6) By the following pairUpdating;
wherein g=1,..n, N, is an equivalence relation;
(7) By the following pairUpdating;
wherein, g=1, N;
(8) Obtained by the following method
(9) Obtaining the shared hidden variable of the first layer through the following sampling
(10) Obtaining multiple hidden variables of the first layer through the following sampling
2. Training layer by layer from bottom to top, and updating global parameters in the model;
training each hidden layer of the multiple deep topic models according to the sequence of the 1 st, the 2 nd and the 3 rd hidden layers, namely training from the bottom layer to the top layer; when training the first layer (l=1, 2, 3), the diverse global variable matrix Φ of this layer is first updated from 200 samples of the mini-block dataset (l) Sharing global variable matrix pi (l) The global diversity parameters of the layer are updated, and the specific steps of updating the parameters are as follows:
(1) Obtaining a first layer one-dimensional diversified expected matrix by using the following method
Where t is the t-th mini block data set,is M (l) J=1,..,K l ,w=1,...,K l-1 the method comprises the steps of carrying out a first treatment on the surface of the j=1..n, E is desired; ρ is the ratio of the number of samples of the complete dataset to the sample of the mini-block dataset;
(2) Obtaining a first layer one-dimensional sharing expected matrix by using the following method
(3) For the first layer various global parameters by the following methodGradient updating is carried out;
wherein d=1,.. lPhi saved after training for t-th mini-block dataset (l) Column D of the matrix; n is a normal distribution, diag is a diagonal matrix, j=1,..n; w=1..k l-1T is matrix transposition;
(4) Sharing global parameters for layer I usingGradient updating is carried out;
wherein,
(5) Intermediate vectors are calculated using
(6) Obtaining a top layer one-dimensional diversified expected matrix M by using the following method (4)
Where j=1,, N;
(7) For the top global parameter r, the following is used t+1 Gradient updating is carried out;
where j=1,..,for matrix x (4) Is the j-th column vector of (c).
Further, training the trained multiple deep topic models according to the test sample set to obtain a plurality of test hidden layer characteristics.
The local parameters of the network are updated layer by layer from the top layer to the bottom layer, and the specific updating steps are as follows:
(1) Obtaining a two-dimensional interlayer augmentation diversity matrix of the first layer through the following stepsSharing matrix with two-dimensional interlayer augmentation
Wherein w, g are matrix index numbers, w=1, 2,.. l-1 ,K 0 =v,v is the dimension of each input sample, g=1, 2,..j; mult (·) is a polynomial distribution; x is x (l) For a two-dimensional inter-layer augmentation matrix, when l=1, x (1) Is the input data x, when l of the multi-deep topic model>At 1, x (l) A two-dimensional interlayer amplification matrix between the first layer and the second layer;is x (l) Elements at the w-th row-g-th column position; phi (phi) w: (l) Is phi (l) The element of row w of the matrix; />Is->Element on g-th column of->For vector->Is an element of dimension g.
(2) Obtaining a three-dimensional augmentation matrix T of the first layer through the following steps (l) Vectors at w-th row-g-th column positions of (c)
Wherein w, g, h are matrix subscript numbers, w, g is as (1), h=1, 2,.. l ;x (l)(1) An augmented diversity matrix for two-dimensional layers between the first layer and the second layer;is x (l)(1) Elements at the w-th row-g-th column position; phi (phi) wh (l) Is phi (l) Elements in the position of the w-th row and h-th column of the matrix; />Is->An element at the h row g column position of (a);
(3) Obtaining a two-dimensional interlayer amplification matrix x between the first layer and the first layer (1+1) (l+1) Vectors at row w and column g positions;
wherein, the CRT is the largest dining table distribution in the Chinese restaurant process;is x (l+1) Elements at the w-th row-g-th column position; />Is phi (l) W th row of matrix kth l+1 Elements of a column; />Is->Is the kth of (2) l+1 Elements on row g;for vector->Is an element of dimension g;
(4) Using the following pair c 0 Updating;
wherein y=1,.. 2 The equivalent relation is about;
(5) Gamma is prepared by the following method 0 Updating;
wherein, the equivalent relation is about the same as the first one, w=1..k 2 ;R=1,...,J;
(6) By the following pairUpdating;
wherein g=1,..n, N, is an equivalence relation;
(7) By the following pairUpdating;
wherein, g=1, N;
(8) Obtained by the following method
(9) Obtaining the shared hidden variable of the first layer through the following sampling
(10) Obtaining multiple hidden variables of the first layer through the following sampling
Judging whether the number of test cycles reaches a preset k=200 times, if so, completing the test stage, storing the parameters obtained by the cycle update for training a classifier when the texts are classified, and entering a step 7; otherwise, the updated parameters in the test cycle process are saved and used as initial values of the next test, and the step 5 is returned.
Further, carrying out visual analysis on the training model parameters according to the hidden layer characteristics;
the model topic visualization flow is given below:
the first step: a first tier topic is determined. Mapping the multiple global variable matrixes obtained by the first layer learning to a voice interval by means of a BOG coding mode, wherein each column vector of multiple global variables corresponds to a theme, words corresponding to each theme in the voice interval are ordered according to the magnitude of the vector value, words in each theme are displayed according to the order, and then the theme of the first layer is formed.
And a second step of: the second and third layer of the model subjects are by the following formula:
obtainingSecond and third layers of topics respectively, their longitudinal dimension and the multiple global variable topics phi of the first layer (1) The visualization of the subject is performed in the same manner as in the first step.
In one embodiment of the present invention, classifying the text data according to the plurality of text topics, the training sample set, the test hidden layer features, and the post-test multiple deep topic model includes:
classifying the training support vector machines with the training according to the training sample set, the text topics and the test hidden layer characteristics;
classifying the test sample set according to the training support vector machine classification to obtain a predicted text class label;
and comparing the category label of the prediction sample set with the category label of the prediction text to obtain the text classification accuracy and complete classification of the test sample set.
The invention is further described in connection with simulation experiments.
1. Simulation conditions:
the database used in the simulation experiment of the invention contains 20 groups of news data with different categories, contains 18845 documents in total, and has the corresponding vocabulary size of 61188. Wherein the training sample set contains 11315 news documents and the test sample set contains 7530 news documents.
2. Simulation content and result analysis:
the simulation experiment 1 is used for verifying that the method can extract the layering topics of the text data, and the high-level topics have diversity. Simulation experiment 2 is used for verifying that hidden layer characteristics of data obtained by the method have better separability.
The simulation experiment 1 is to extract layering characteristics of the news data set in the simulation condition by using the method and the prior art, namely the latent dirichlet distribution model. The various theme features of the first layer theme and the third layer theme in the two methods are respectively obtained, and the visualization results are shown in fig. 2 and 3. Fig. 2 and 3 are each the i-th group theme of the current layer.
As can be seen from fig. 2 and 3, the method of the present invention can obtain more various topics with richer semantic information.
The simulation experiment 2 is a hidden layer feature obtained by extracting deep features of the news data set in the simulation condition by the method, classifying by SVM, obtaining classification accuracy results under different dimensions of the model, and comparing with a latent dirichlet allocation model (LDA) and a deep latent dirichlet allocation model (DLDA) of the existing method. Wherein the latent dirichlet allocation model is a single layer model with hidden layer dimensions identical to the first hidden layer dimensions of the deep latent dirichlet allocation model and the deep diverse latent dirichlet allocation model. As can be seen from fig. 4, the method of the present invention can obtain hidden layer features with better separability.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (4)

1. A text analysis method based on a multiple deep topic model, comprising:
constructing a training sample set and a test sample set of text data;
constructing a multiple deep topic model according to the training sample set, and initializing initial model parameters of the multiple deep topic model;
training the multiple deep topic models according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain trained multiple deep topic models;
testing the trained multiple deep topic models according to the test sample set to obtain a plurality of test hidden layer characteristics;
performing visual analysis on the text data according to the test hidden layer characteristics to obtain a plurality of text topics;
classifying the text data according to the text topics, the training sample set, the test hidden layer features and the tested multiple deep topic model;
the training the multiple deep topic model according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep topic model comprises the following steps:
randomly scrambling the training data set and dividing the training data set into 200 mini block data sets with single sample numbers; circularly training the multiple deep topic models by using the mini block data set to obtain training model parameters;
the specific steps in each cycle training for each mini-block dataset are as follows:
1) Acquiring local parameters in the multiple deep topic models;
(1) Obtaining a two-dimensional interlayer augmentation diversity matrix of the first layer through the following stepsSharing matrix with two-dimensional interlayer augmentation>
Wherein w, g are matrix index numbers, w=1, 2,.. l-1 ,K 0 V, v is the dimension of each input sample, g=1, 2,..j; mult (·) is a polynomial distribution; x is x (l) For a two-dimensional inter-layer augmentation matrix, when l=1, x (1) Is the input data x, when l of the multi-deep topic model>At 1, x (l) A two-dimensional interlayer amplification matrix between the first layer and the second layer;is x (l) Elements at the w-th row-g-th column position; phi (phi) w: (l) Is phi (l) The element of row w of the matrix; />Is->Elements on the g-th column of (2); />For vector->Is an element of dimension g;
(2) Obtaining a three-dimensional augmentation matrix T of the first layer through the following steps (l) Vectors at w-th row-g-th column positions of (c)
Wherein w, g, h are matrix index numbers, h=1, 2, respectively l ;x (l)(1) An augmented diversity matrix for two-dimensional layers between the first layer and the second layer;is x (l)(1) Elements at the w-th row-g-th column position; phi (phi) wh (l) Is phi (l) Elements in the position of the w-th row and h-th column of the matrix; />Is->An element at the h row g column position of (a);
(3) Obtaining a two-dimensional interlayer amplification matrix x between the first layer and the first layer (1+1) (l+1) Vectors at row w and column g positions;
wherein, the CRT is the largest dining table distribution in the Chinese restaurant process,is x (l+1) Elements at the w-th row-g-th column position; />Is phi (l) W th row of matrix kth l+1 Elements of a column; />Is->Is the kth of (2) l+1 Elements on row g; />For vector->Is an element of dimension g;
(4) Using the following pair c 0 Updating;
wherein y=1,.. 2 The equivalent relation is formed in a way that,
(5) Gamma is prepared by the following method 0 Updating;
wherein, the equivalent relation is about the same as the first one, w=1..k 2 ;R=1,...,J,
(6) By the following pairUpdating;
wherein g=1,..,
(7) By the following pairUpdating;
wherein, g=1, N,
(8) Obtained by the following method
(9) Obtaining the shared hidden variable of the first layer through the following sampling
(10) Obtaining multiple hidden variables of the first layer through the following sampling
2) Training layer by layer from bottom to top, and updating global parameters in the model;
the text theme visualization flow is as follows:
the first step: determining a first layer theme;
mapping the multiple global variable matrixes obtained by the first layer of learning to a voice interval by means of a BOG coding mode, wherein each column vector of multiple global variables corresponds to a theme, words corresponding to each theme in the voice interval are ordered according to the magnitude of the vector value, words in each theme are displayed according to the order, and the theme of the first layer is formed;
and a second step of: the second and third layer topics were obtained by means of the following formula:
wherein,second and third layers of topics respectively, their longitudinal dimension and the multiple global variable topics phi of the first layer (1) The same applies.
2. The text analysis method based on a multiple deep topic model of claim 1, wherein constructing a multiple deep topic model from the training sample set and initializing initial model parameters of the multiple deep topic model, wherein the initial model parameters are hidden layer features, comprises:
obtaining hidden layer characteristics of a plurality of various deep topic models according to the various deep topic models, wherein the hidden layer characteristics of the various deep topic models comprise various hidden variables and shared hidden variables;
initializing a plurality of the multiple hidden variables and the shared hidden variable.
3. The text analysis method based on a multiple deep topic model according to claim 1, wherein training multiple deep topic models according to the training sample set to obtain training model parameters, and updating the initial model parameters according to the training model parameters to obtain trained multiple deep topic models, wherein the training model parameters are training hidden layer features, comprising:
dividing the training sample set into a plurality of training data sets;
analyzing the training data sets in the multiple deep topic models for several times to obtain a plurality of training model parameters;
and updating the initial model parameters according to the training model parameters to obtain a trained multiple deep theme model.
4. The method of claim 1, wherein classifying the text data according to the plurality of text topics, the training sample set, the test hidden layer features, and the tested multi-deep topic model comprises:
training a support vector machine to be trained according to the training sample set, the text topics and the test hidden layer characteristics;
classifying the test sample set according to the trained support vector machine to obtain a predicted text class label;
and comparing the category label of the prediction sample set with the prediction text category label to obtain the text classification accuracy and complete classification of the text data.
CN201910750551.2A 2019-08-14 2019-08-14 Text analysis method based on multiple deep topic models Active CN112395413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910750551.2A CN112395413B (en) 2019-08-14 2019-08-14 Text analysis method based on multiple deep topic models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910750551.2A CN112395413B (en) 2019-08-14 2019-08-14 Text analysis method based on multiple deep topic models

Publications (2)

Publication Number Publication Date
CN112395413A CN112395413A (en) 2021-02-23
CN112395413B true CN112395413B (en) 2023-12-15

Family

ID=74601468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910750551.2A Active CN112395413B (en) 2019-08-14 2019-08-14 Text analysis method based on multiple deep topic models

Country Status (1)

Country Link
CN (1) CN112395413B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2487636A1 (en) * 2011-02-09 2012-08-15 Vertical Axis, Inc. System and method for topic based sentiment search and sharing across a network
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network
CN106599128A (en) * 2016-12-02 2017-04-26 西安电子科技大学 Deep theme model-based large-scale text classification method
CN106814942A (en) * 2015-11-27 2017-06-09 北京奇虎科技有限公司 A kind of methods, devices and systems for realizing self-defined theme
CN107609055A (en) * 2017-08-25 2018-01-19 西安电子科技大学 Text image multi-modal retrieval method based on deep layer topic model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2487636A1 (en) * 2011-02-09 2012-08-15 Vertical Axis, Inc. System and method for topic based sentiment search and sharing across a network
CN106814942A (en) * 2015-11-27 2017-06-09 北京奇虎科技有限公司 A kind of methods, devices and systems for realizing self-defined theme
CN106446117A (en) * 2016-09-18 2017-02-22 西安电子科技大学 Text analysis method based on poisson-gamma belief network
CN106599128A (en) * 2016-12-02 2017-04-26 西安电子科技大学 Deep theme model-based large-scale text classification method
CN107609055A (en) * 2017-08-25 2018-01-19 西安电子科技大学 Text image multi-modal retrieval method based on deep layer topic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC》;Yulai Cong等;《arXiv:1706.01724v1》;20170606;第1-19页 *
基于深度模型的社会新闻对用户情感影响挖掘;孙晓;高飞;任福继;;中文信息学报(03);全文 *

Also Published As

Publication number Publication date
CN112395413A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
Celikyilmaz et al. LDA based similarity modeling for question answering
CN109558487A (en) Document Classification Method based on the more attention networks of hierarchy
CN109740655B (en) Article scoring prediction method based on matrix decomposition and neural collaborative filtering
Rezaei et al. Multi-document extractive text summarization via deep learning approach
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN110543564A (en) Method for acquiring domain label based on topic model
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
Umarani et al. Sentiment analysis using various machine learning and deep learning Techniques
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN114153974A (en) Character-level text classification method based on capsule network
Nassiri et al. Arabic L2 readability assessment: Dimensionality reduction study
CN116304063A (en) Simple emotion knowledge enhancement prompt tuning aspect-level emotion classification method
CN112395413B (en) Text analysis method based on multiple deep topic models
Tashu et al. Deep Learning Architecture for Automatic Essay Scoring
Harris Searching for Diverse Perspectives in News Articles: Using an LSTM Network to Classify Sentiment.
Artene et al. An experimental study of Convolutional Neural Networks for functional and subject classification of web pages
Preetham et al. Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
Nahar et al. A Comparative Selection of Best Activation Pair Layer in Convolution Neural Network for Sentence Classification using Deep Learning Model
Vaddem et al. Myers Briggs Personality Prediction Using Machine Learning Techniques
Ashwini et al. Impact of Text Representation Techniques on Clustering Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant