CN109408641A - It is a kind of based on have supervision topic model file classification method and system - Google Patents
It is a kind of based on have supervision topic model file classification method and system Download PDFInfo
- Publication number
- CN109408641A CN109408641A CN201811398232.1A CN201811398232A CN109408641A CN 109408641 A CN109408641 A CN 109408641A CN 201811398232 A CN201811398232 A CN 201811398232A CN 109408641 A CN109408641 A CN 109408641A
- Authority
- CN
- China
- Prior art keywords
- theme
- slda
- classification
- text
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
Present disclose provides a kind of based on the file classification method and system that have supervision topic model.Wherein, a kind of based on the file classification method for having supervision topic model, comprising: building SLDA-TC textual classification model;It during training SLDA-TC textual classification model, is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only carries out implicit theme sampling from other training texts identical with text categories label where the word;After determining the implicit theme of each word, by counting the frequency, text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution is calculated;Establish the accurate mapping between theme and classification;The SLDA-TC textual classification model that text input to be measured is generated to training is inferred to the theme of text to be measured, and then predicts the classification of text.
Description
Technical field
This disclosure relates to data classification field more particularly to it is a kind of based on have supervision topic model file classification method
And system.
Background technique
Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill
Art.
Text representation is the important step of text mining, and widest document representation method is bag of words method (Bag- at present
of-word,BOW).Bag of words method regards a text set of word as, and assume each word appearance be it is independent, disobey
The other words of Lai Yu, and ignore the information such as word order, syntax.Based on BOW, a text is indicated with a n-dimensional vector, per one-dimensional right
A word is answered, the relevant weight of the frequency of the usually word, here it is the most commonly used is vector space model (vector
space model,VSM).Due to the complexity of natural language, there is " dimension disaster ", " sparsity ", " semanteme in text representation
The problems such as loss ".Bag of words method ignores the information such as word order, syntax, so that the semantic information of word is difficult to extract and quantify, text
This semantic expressiveness is still very difficult at present.
The word2vec model that Mikolov et al. is proposed, is a kind of training method of term vector, utilizes the context of word
One word is converted to a low-dimensional real vector by information, and more similar word is closer in vector space.Word2vec model
What training exported is the term vector of each word, and the term vector of all words of text forms text vector.Based on word2vec model
Trained term vector text input deep neural network, is used successfully to Chinese word segmentation, POS tagging, emotional semantic classification, syntax
Dependence etc..Word2vec model is able to solve " sparsity " problem, although word2vec is capable of quantificational word and word
Similarity, but not can solve " semanteme is lost " and " dimension disaster " problem of text.
Topic model (topic model) be can be used for solving " dimension disaster ", a kind of method of " sparsity ", and
The semantic information of word can be extracted to a certain extent.Topic model indexes (Latent Semantic originating from Latent Semantic
Indexing, LSI), and (probabilistic Latent is indexed by the probability Latent Semantic that Hofmann is proposed
Semantic Indexing,pLSI).On the basis of pLSI, Blei et al. proposes LDA (Latent Dirichlet
Allocation) topic model.Theme regards the probability distribution of word as in LDA, and the word of semantic similarity is built by implicit theme
Vertical association, can extract semantic information, by text representation from higher-dimension word spatial alternation to low-dimensional theme space from text.
Topic model is directly or extension use is in natural language processing field, such as cluster and classification, word sense disambiguation, sentiment analysis, figure
As tasks such as the target detection of process field and positioning, image segmentations.
LDA topic model by text representation from the word spatial alternation of higher-dimension to the theme space of low-dimensional, then using KNN,
Naive Bayesian, SVM scheduling algorithm Direct Classification, effect are simultaneously bad.Reason is that LDA topic model is unsupervised
It practises, does not consider the classification of text, there is no classification this important informations marked using training text.
Existing improved method, if Li et al. people proposes Labled-LDA model, inventor has found the model for every class
One LDA model of document training, the parameter for needing to estimate increase more times, increase the complexity of model.
Summary of the invention
According to the one aspect of one or more other embodiments of the present disclosure, provide a kind of based on there is supervision topic model
File classification method can identify the semantic relation between theme-classification, establish the Precision Mapping of theme and classification.
One or more other embodiments of the present disclosure, provide it is a kind of based on have supervision topic model file classification method,
Include:
Construct SLDA-TC textual classification model, each document band of the Training document collection of SLDA-TC textual classification model
There is class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, master
Topic-Word probability distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter according to SLDA-TC-Gibbs algorithm and estimates
Meter;Wherein, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: imply to each word
Theme is sampled, and is only carried out implicit theme from other training texts identical with text categories label where the word and adopted
Sample;After determining the implicit theme of each word, by counting theme-word, document-theme, theme-classification frequency, calculate
Text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution are obtained, and then establishes out theme and class
Accurate mapping between not;
Text subject to be measured is inferred and classification;The SLDA-TC text classification mould that text input to be measured is completed to training
Type carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;According to
The theme distribution of document to be measured and theme-category distribution of SLDA-TC model, export the class label of text to be measured.
In one or more embodiments, text-theme probability distribution, theme-Word probability distribution and theme-classification are general
Dirichlet distribution is obeyed in rate distribution.
In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training
Terminate, by JS divergence assess similarity between theme, by theme-category distribution parameter evaluation theme of SLDA-TC with
Semantic relevancy between classification.
In one or more embodiments, the evaluation index of the classification results of SLDA-TC textual classification model, including it is macro
Average nicety of grading (Macro-precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-
F1)。
One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device,
Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute
Stating when program is executed by processor can be realized following steps:
Construct SLDA-TC textual classification model, each document band of the Training document collection of SLDA-TC textual classification model
There is class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, master
Topic-Word probability distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter according to SLDA-TC-Gibbs algorithm and estimates
Meter;Wherein, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: imply to each word
Theme is sampled, and is only carried out implicit theme from other training texts identical with text categories label where the word and adopted
Sample;After determining the implicit theme of each word, by counting theme-word, document-theme, theme-classification frequency, calculate
Text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution are obtained, and then establishes out theme and class
Accurate mapping between not;
Text subject to be measured is inferred and classification;The SLDA-TC text classification mould that text input to be measured is completed to training
Type carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;According to
The theme distribution of document to be measured and theme-category distribution of SLDA-TC model, export the class label of text to be measured.
The beneficial effect of the disclosure is:
The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion,
Be distributed using text-theme probability distribution, the distribution of theme-Word probability and theme-class probability, extract word and theme, document with
The semantic information mapping implied between theme, theme and classification, and theme quantity K need to only take slightly larger than categorical measure C, no
Text classification precision is improved only, and can be improved time efficiency.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, the disclosure
Illustrative embodiments and their description do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is a kind of SLDA-TC file classification method flow chart of the disclosure.
Fig. 2 is LDA topic model.
Fig. 3 is SLDA-TC textual classification model.
When Fig. 4 (a) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Precision compares.
When Fig. 4 (b) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Recall compares.
When Fig. 4 (c) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM
Macro-F1 compares.
When Fig. 5 (a) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM
Precision compares.
When Fig. 5 (b) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM
Recall compares.
When Fig. 5 (c) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM
F1 compares.
Fig. 6 (a) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Precision compare.
Fig. 6 (b) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Recall compare.
Fig. 6 (c) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-F1 compare.
When Fig. 7 (a) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Precision compare.
When Fig. 7 (b) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-Recall compare.
When Fig. 7 (c) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM
Macro-F1 compare.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless
Otherwise indicated, all technical and scientific terms used herein has and disclosure person of an ordinary skill in the technical field
Normally understood identical meanings.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular shape
Formula be also intended to include plural form, additionally, it should be understood that, when in the present specification use term "comprising" and/or
When " comprising ", existing characteristics, step, operation, device, component and/or their combination are indicated.
Term is explained:
Dirichlet distribution: the distribution of Di Li Cray is the probability distribution of one group of continuous multivariable, is that multivariable generalizes
Β distribution, Di Li Cray be distributed frequently as Bayesian statistics prior probability.
Gibbs Sampling method: gibbs sampler is a kind of calculation based on Markov Monte Carlo (MCMC)
Method, for when being difficult to directly sample from the distribution of a certain multivariate probability approximate sample drawn sequence.
LDA-TC: refer to and first pass through the theme that LDA topic model extracts certain amount, then divided according to LDA model
The method of class.
SVM:Support Vector Machine, refers to support vector machines, is a kind of common method of discrimination.?
Machine learning field is the learning model for having supervision, commonly used to carry out pattern-recognition, classification and regression analysis.
Macro-Precision: macro average nicety of grading.
Macro-Recall: macro average recall rate.
Macro-F1: macro average F1 value.
Fig. 1 is a kind of based on the file classification method flow chart for having supervision topic model of the disclosure.
As shown in Figure 1, a kind of file classification method of the present embodiment, comprising:
S110: building SLDA-TC textual classification model;Different from unsupervised LDA model, SLDA-TC text classification mould
Each document of the Training document collection of type has class label;Parameter to be estimated includes text in SLDA-TC textual classification model
Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution.
As shown in Fig. 2, the text collection of LDA topic modelM indicates textual data, and K indicates number of topics.Mould
There are two parameter θs for typemWithWherein θmIndicate the theme probability distribution of m texts,Indicate the Word probability distribution of theme k,
wmIt is the bag of words vector of m texts, NmIndicate the length of m texts, wm,nIndicate n-th of word in m texts, zm,nIt is
Distribute to wm,nTheme.θmWithDirichlet distribution is obeyed, generates theme, word respectively as polynomial parameters, α, β are
The Study first of corresponding Dirichlet distribution.LDA topic model does not consider the classification of each document.
As shown in figure 3, the training text collection of SLDA-TC model different from LDAEvery text wmIt deposits
In the classification number y of an observablem∈ [1, C], C indicate classification number, it is assumed that classification number is obeyed and text subject probability correlation
Multinomial distribution.wm,nAnd ymIt is observable, zm,nIt is implicit theme.SLDA-TC model is in addition to parameterAnd θm, the disclosure
Introduce a new parameter δk, indicate the class probability distribution of k-th of theme.θm、And δkDirichlet distribution is obeyed, is made
Generate theme, word and classification respectively for polynomial parameters, α, β and γ are the Study firsts of corresponding Dirichlet distribution.
S120:SLDA-TC topic model parameter Estimation, training SLDA-TC textual classification model, establishes theme and classification
Between mapping.
LDA using Gibbs Sampling algorithm estimation parameter beAnd the parameter that θ, SLDA-TC model need to estimate
It isθ and δ.The disclosure proposes SLDA-TC-Gibbs algorithm on the basis of Gibbs Sampling method, does not count directly
Calculate θm、And δk, first the implicit theme of each word is sampled, after determining the implicit theme of each word, θm、And δk
The frequency can be counted to be calculated.
Gibbs Sampling algorithm measures z to word w mono- hidden variation every timei=k is sampled, and other words point of z are kept
Amount isIt is constant, it calculates such as formula (1).
SLDA-TC-Gibbs algorithm is every time to word wiThe hidden variation of one of=t measures ziWhen=k is sampled, its of z is kept
Its word component isIt is constant, while also keepingIt is constant, it calculates such as formula (2).
Wherein, w=(w1,...,wm) it is all document term vectors, z=(z1,...,zV) it is that (V is word word to theme vector
Allusion quotation length), y=(y1,...yM) it is categorization vector.Assuming that i-th of word wi=t, ziIndicate the corresponding theme of i-th of word
Variable,Indicate i-th of rejecting vector z.I-th of word w is assumed simultaneouslyiDocument where=t is m document wm, and its
Class is marked as ym=j, j ∈ [1, C], thenIndicate m of rejecting vector y.Indicate that theme k distributes to time of word v
Number, βvIndicate the Dirichlet priori of word v.Indicate that document m distributes to the number of theme z, αzIt is the Dirichlet of theme z
Priori.Indicate i-th (i.e. i-th of word w of rejecting zi=t), theme k distributes to the number of word t, βvIt is word t
Dirichlet priori.It indicates to reject m (m document wmClass be labeled as ym=j), theme k is assigned as classification j
Number of files, γjIndicate the Dirichlet Study first of classification j.
Contrast (1) introduces on formula (2) left sideThe right increases by oneTo in m documents word it is hidden
The sampling containing theme is limited, i.e., only carries out implicit theme from other Training documents identical with m document classifications and adopt
Sample because the theme distribution of the document of the same category be it is similar, to formula (2) when having classification mark more rationally.This public affairs
It opens and only generates a SLDA-TC model, need to only estimate one group of parameterθ and δ.The derivation of formula (2) proves as follows:
It proves: given Training document collectionEnable w=(w1,...,wm), y=(y1,...yM), z=
(z1,...,zV), the Joint Distribution such as (3) formula of SLDA-TC probabilistic model.
P (w, z, y | α, β, γ)=p (w | z, β) p (z | α) p (y | z, γ) (3)
It is distributed from Dirichlet:
Wherein,Γ () is Gamma
Function.
According to SLDA-TC-Gibbs algorithm,
To the word w of m documentsi=t,It is constant, therefore
By (3)-(6) and (8), can obtain
(2) formula can be demonstrate,proved as a result,.
After the theme z label for obtaining each word w, the parameter of SLDA-TC modelθmAnd δkIt calculates as follows.
WhereinIndicate that theme k distributes to the probability of word t, θm,kIndicate that document m distributes the probability for the k that is the theme, δk,jTable
Show that theme k belongs to the probability of classification j,To indicate that theme k distributes to the number of word t,Indicate that document m distributes to theme k
Number,Indicate that theme is the number of files that k is assigned as classification j, k=1..K, t=1..V, m=1..M.
Thus the parameter of SLDA-TC modelθ and δ estimation finishes.
SLDATC-Gibbs algorithm description:
Algorithm:
Input: document vectorHyper parameter α, β, γ, number of topics K, the number of iterations T.
Output: theme z distribution, parameterθ and δ
Initialization: variableIt is initialized as 0.
After training generates SLDA-TC model, formula such as (13) are inferred to the implicit theme of each word in d new documents.
Wherein,Indicate the term vector of d new documents,Its theme vector is represented,Indicate that new document rejects i-th
Item (i.e. i-th of word), theme k distributes to the number of word t,It indicates to reject i-th, d new documents distribute to master
Inscribe the number of k, other symbol meaning reference formulas (2).
The implicit theme label of the new each word of document d is calculated by formula (13), then calculates the probability that d belongs to each theme
Such as formula (14).
The theme probability distribution of new document d is obtained as a result,
S130: the SLDA-TC textual classification model that text input to be measured is completed to training is inferred to text to be measured
Theme, and then predict text classification.
It is given to train SLDA-TC model, it enablesFor the term vector of d new documents,For its theme vector,It is pair
The class prediction of new document d, the classified calculating of new document are as follows.
It is assumed that test sample collection and training sample set are obeyed with distribution, i.e., the two is in implicit theme-category distribution
Unanimously, therefore,It can be replaced by p (y | z), be disclosed by the parameter δ of SLDA-TC model.It is new document d
Theme probability distributionBy formula (12), (14) and (15), can obtain:
Similarity between theme is assessed with JS divergence (Jensen-Shannon divergence), and JS divergence is also referred to as JS
Distance is a kind of deformation of KL divergence (Kullback-Leibler divergence), calculates such as formula (17).Different from KL
Divergence, JS divergence meet symmetry and triangle range formula.
JS(pi||pj)=0.5*KL (pi||(pi+pj)/2)+0.5*KL(pj||(pi+pj)/2) (17)
Wherein, piAnd pjThe Word probability distribution of theme i and theme j is respectively indicated, the codomain range of JS divergence is [0,1], 0
Indicate piAnd pjIt is distributed identical, 1 indicates opposite.
Semantic relevancy between theme and classification is measured by the parameter δ of SLDA-TC model, is calculated as shown in formula (12),
δk,jIndicate that theme k belongs to the probability of classification j.
In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training
Terminate, the similarity between theme assessed by JS divergence, by theme-category distribution parameter evaluation theme of SLDA-TC with
Semantic relevancy between classification;The macro average nicety of grading (Macro- of the text classification evaluation of result index of SLDA-TC model
Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1).
Experimental analysis verifying:
Tri- data subsets of rec, sci and talk of English data set 20newsgroup are chosen, and include IT, army
Thing, education, 5 classes of tourism and finance and economics search dog Chinese corpus data subset, training sample and test specimens in each data subset
This ratio is 8:2, and data subset description is as described in Table 1.The participle of Chinese and English data set is segmented using jieba, and English stem mentions
It takes using nltk.stem, after removing stop words, feature selecting is carried out using TF-IDF, retains 60% feature in experiment
Word.
The description of 1. data set of table
Data set | Training text number | Classification number | Characteristic |
20news-rec | 3979 | 4 | 47067 |
20news-sci | 2373 | 4 | 57943 |
20news-talk | 1676 | 3 | 40945 |
sogou | 2445 | 5 | 70819 |
For the validity for verifying proposed method, experiment is carried out to tri- algorithms of SLDA-TC, LDA-TC and SVM and has been compared.
SLDA-TC is the algorithm proposed in the disclosure, and LDA-TC is in traditional LDA Direct Classification, and SVM is K using LDA model
Svm classifier algorithm of the theme as feature.
The target that the theme of SLDA-TC model is inferred is the mapping established between theme and classification, number of topics K and label
The classification number C of training set is related, and experiment shows K only and need to take a value slightly larger than C, and SLDA-TC can reach to be divided well
Class precision.JS distance, preceding 10 features of each theme in experiment, the theme that we generate SLDA-TC topic model
Similarity between the probability distribution of word, class and theme is analyzed.Sogou data subset C=is described such as 2~table of table 4
5, the experimental result of the SLDA-TC topic model generated when K=8,6~table of table 8 describe 20news-talk data subset
On experimental result, wherein α, β and γ value of SLDA-TC topic model be 0.01.
JS divergence (SLDA-TC, sogou, C=5, K=8) between 2 theme of table
The degree of correlation (SLDA-TC, sogou, C=5, K=8) of 3 theme of table and class
The probability distribution (SLDA-TC, sogou, C=5, K=8) of 10 words before 4 theme of table
As shown in table 2, the JS divergence between theme 2,5,7 is 0, illustrates that they are the identical themes of distribution, can from table 4
To find out, the probability distribution of 10 Feature Words is also identical before this 3 themes.JS divergence between other 5 themes exists
Between 0.45 to 0.58, explanation is 5 different themes of dispersion.As can be seen from tables 3 and 4 that 0 mapclass of theme " IT ",
1 mapclass of theme " tourism ", 3 mapclass of theme " finance and economics ", 4 mapclass of theme " military affairs ", " religion of 6 mapclass of theme
Educate ", similarity is 99% or more, and theme 2,5,7 is, referred to as " useless " theme unrelated with any classification.
JS divergence (SLDA-TC 20news-talk, C=3, K=6) between 5. theme of table
The degree of correlation (SLDA-TC, 20news-talk, C=3, K=6) of 6 class of table and theme
The probability distribution (SLDA-TC, 20news-talk, C=3, K=6) of 10 words before 7 theme of table
It is the experimental result on 20news-talk shown in 5~table of table 7, theme 1,3 and 4 is that distribution is identical " useless "
Theme, the corresponding classification talk.politics.guns of significant theme 0, the corresponding talk.politics.misc of theme 2, master
The corresponding talk.politics.mideast of topic 5.
By many experiments it was found that as K-C >=2, the " useless " main of the theme that K-C JS divergence is 0 can be generated
Topic, as long as therefore K selection be slightly larger than C value, so the K of SLDA-TC topic model be easy determination.SLDA-TC master
Topic model can filter out K-C " useless " themes, establish the accurate mapping of theme and classification.Meanwhile K is only slightly larger than C, far
K value lower than LDA, can significantly reduce the training time of model.
Fig. 4 (a)~Fig. 7 (c) describe sogou Chinese, 20newsgroup 3 Sub Data Sets on, by TF-
IDF feature selecting retains 60% Feature Words, and number of topics K takes 10 generation SLDA models of different value iteration and LDA model,
The classification results of SLDA-TC, LDA-TC and SVM compare.
In Fig. 4 (a)~Fig. 7 (c), abscissa indicates the number of iterations.
When Fig. 4 (a)~Fig. 7 (c) describes that number of topics K takes different value on 4 data sets, SLDA-TC, LDA-TC and SVM
Classification results compare.
(1) K need to only be slightly larger than classification number C when the theme of SLDA-TC, and classification results are better than LDA-TC and SVM.
As shown in Fig. 4 (a)-Fig. 4 (c), when 20news-rec data set C=4, K=8, for Macro-Precision,
Macro-Recall and Macro-F1 classification indicators, SLDA-TC 95.10%, 94.99% and 94.98%, and LDA-TC is
63.76%, 60.91% and 60.33%, SVM 68.82%, 68.33%, 68.08%.As K increases, LDA-TC and SVM
It increases, when K=80, it is 71.85%, 71.38% and 71.41% that LDA-TC, which is up to, and SVM is up to
83.90%, 83.70% and 83.62%, also below SLDA-TC.In addition the training time of K=80, topic model are much high
In the training time of SLDA-TC model K=6.
Shown in as shown in Fig. 5 (a)-Fig. 5 (c), when sogou data set C=5, K=8, for Macro-Precision,
Macro-Recall and Macro-F1 classification indicators, SLDA-TC 92.80%, 92.73% and 92.70%, and LDA-TC is
72.67%, 68.89% and 67.48%, SVM 80.69%, 80.40%, 80.28%.With the increase of K, the classification of SVM
Index steps up, and when K is greater than 60, three kinds of classification indicators have risen to 89.26%, 89.95%, 89.24%, but also
92.80%, 92.73% and the 92.70% of SLDA-TC when being less than K=8, and the SVM (LDA theme is characterized) of K=60,
More time costs are paid than the SLDA-TC of K=8.
(2) the K value of SLDA-TC is not the bigger the better, and when K is very big, three classification indicators of SLDA-TC are instead
Decline, this shows that K need to only be slightly larger than C.
Shown in as shown in Fig. 6 (a)-Fig. 6 (c), 20news-sci data set C=4, as K=90, SLDA-TC classification knot
Fruit is instead very poor.Reason is classification number C=4, and only 4 themes are related to classification in 90 themes of generation, remaining 86
It is " useless " theme, classification is not helped, interference is caused instead, classification results is caused to be deteriorated.Such as Fig. 7 (a)-Fig. 7 (c)
Shown in 20news-talk data set C=3, K=90 when be also such.Largely the experimental results showed that, SLDA-TC is calculated
Method, K take the value slightly larger than C, can be obtained very high nicety of grading.
SLDA-TC is as shown in table 8 compared with LDA-TC, SVM time performance and classification results on different data sets.
Table 8.SLDA-TC is compared with LDA-TC, SVM time performance and classification results
The generation time of topic model is directly proportional to number of topics K, and K value is bigger, and time cost is higher.To SLDA-TC mould
Type, number of topics K only need to can be obtained extraordinary classification results slightly larger than classification number C, and LDA-TC and use LDA theme for
The SVM algorithm of feature then needs K value to reach tens, Shang Baishi, could obtain preferable classification results, from Fig. 4 (a)-Fig. 7 (d)
Shown in experimental result it is also seen that.
As shown in table 8, on 20news-rec, LDA and SLDA-TC model generation time ratio are 4.86, i.e. SLDA ratio
Fast 4.86 times of LDA, because the K of the two is respectively 200 and 8.On 20news-sci, 20news-talk and sogou data set,
Fast 4.78,5.16 and 4.90 times of SLDA ratio LDA difference.Meanwhile from Macro-Precision, Macro-Recall and Macro-
From the point of view of F1 classification indicators, SLDA-TC algorithm is substantially better than LDA-TC and SVM algorithm, on 4 kinds of data sets, SLDA-TC ratio SVM high
Out 3.10%~9.30%, 7.10%~34.08% is higher by than LDA-TC.
In conclusion SLDA-TC model number of topics K need to only take the value slightly larger than classification number C, can identify close with classification
Relevant theme is cut, and is all substantially better than LDA-TC in nicety of grading and time performance and LDA theme is used to be characterized
SVM algorithm.
The disclosure the text classification the problem of, proposes a kind of based on there is supervision topic model for LDA
SLDA-TC textual classification model proposes SLDA-TC-Gibbs parameter estimation algorithm, every time to word wiA hidden change of=t
Component ziWhen=k is sampled, other word components of z are kept i.e.It is constant, while also keepingIt is constant, i.e., only from the word institute
Implicit theme sampling is carried out in the identical other Training documents of document class label, because of the theme of the document of the same category point
Cloth is similar, and gives theoretical proof.SLDA-TC model introduces theme-class probability distribution parameter δ, passes throughθ and δ probability distribution extracts the semantic information implied between word and theme, document and theme, theme and classification and maps.Separately
Outside, SLDA-TC number of topics K need to only be slightly larger than the value of classification number.Experiment shows that SLDA-TC model can significantly improve classification essence
Degree and time efficiency.
One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device,
Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute
It states and can be realized following steps as shown in Figure 1 when program is executed by processor:
(1) SLDA-TC textual classification model is constructed;Different from unsupervised LDA model, SLDA-TC textual classification model
Each document of Training document collection has class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes
Text-theme probability distribution, theme-Word probability distribution further include theme-class probability distribution;Wherein, text subject probability
Dirichlet distribution is obeyed in distribution, the Word probability distribution of theme and the class probability distribution of theme.
(2) training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation.
Specifically, during training SLDA-TC textual classification model, number of topics K is first set, is taken slightly larger than classification
The value of number C;Then it is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only where with the word
Implicit theme sampling is carried out in the identical other training texts of text categories label;After determining the implicit theme of each word,
By counting theme-word, document-theme, theme-classification frequency, text-theme probability distribution, theme-word is calculated
Probability distribution and theme-class probability distribution;Establish the accurate mapping between theme and classification;
(3) text subject to be measured infers and classifies.The SLDA-TC text classification that text input to be measured is completed to training
Model carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;Root
According to the theme distribution of document to be measured and theme-category distribution of SLDA-TC model, the class label of text to be measured is exported.
(4) SLDA-TC model and classification results assessment.To the SLDA-TC model that successive ignition training generates, pass through JS
Divergence assesses the similarity between theme, by the theme-of SLDA-TC between category distribution parameter evaluation theme and classification
Semantic relevancy;The text classification evaluation of result of SLDA-TC model is by the macro average nicety of grading (Macro- of index
Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1) assessment.
The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion,
It is distributed using the Word probability distribution of text subject probability distribution, theme and the class probability of theme, extracts word and theme, document
The semantic information mapping implied between theme, theme and classification, improves text classification precision and time efficiency.
It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer journey
Sequence product.Therefore, hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure
Form.It is deposited moreover, the disclosure can be used to can be used in the computer that one or more wherein includes computer usable program code
The form for the computer program product implemented on storage media (including but not limited to magnetic disk storage and optical memory etc.).
The disclosure be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions each in flowchart and/or the block diagram
The combination of process and/or box in process and/or box and flowchart and/or the block diagram.It can provide these computers
Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices
To generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute
For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram
Device.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that instruction stored in the computer readable memory generation includes
The manufacture of command device, the command device are realized in one box of one or more flows of the flowchart and/or block diagram
Or the function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that
Series of operation steps are executed on computer or other programmable devices to generate computer implemented processing, thus calculating
The instruction executed on machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side
The step of function of being specified in block diagram one box or multiple boxes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can
It is completed with instructing relevant hardware by computer program, the program can be stored in a computer-readable storage
In medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can
For magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random
AccessMemory, RAM) etc..
Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, not the disclosure is protected
The limitation of range, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art
Member does not need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.
Claims (8)
1. a kind of based on the file classification method for having supervision topic model characterized by comprising
SLDA-TC textual classification model is constructed, each document of the Training document collection of SLDA-TC textual classification model has classification
Label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, theme-Word probability
Distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation according to SLDA-TC-Gibbs algorithm;Its
In, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: to the implicit theme of each word into
Row sampling, and implicit theme sampling is only carried out from other training texts identical with text categories label where the word;True
After the implicit theme of fixed each word, by counting theme-word, document-theme, theme-classification frequency, text is calculated
Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution, and then establish out between theme and classification
Accurate mapping;
Text subject to be measured is inferred and classification;The SLDA-TC textual classification model that text input to be measured is completed to training, first
Implicit theme sampling is carried out to each word of document to be measured;Then infer the theme probability distribution of text to be measured;According to document to be measured
Theme distribution and SLDA-TC model theme-category distribution, export the class label of text to be measured.
2. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that SLDA-
The parameter for needing to estimate in TC textual classification model not only includes text-theme probability distribution, theme-Word probability distribution, is also wrapped
Theme-class probability distribution is included, parameter obeys Dirichlet distribution.
3. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that by more
Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme
Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.
4. as claimed in claim 3 a kind of based on the file classification method for having supervision topic model, which is characterized in that the text
This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.
5. a kind of based on the Text Classification System for having supervision topic model, including text input device, controller and display device,
The controller includes memory and processor, which is characterized in that the memory is stored with computer program, described program quilt
Processor can be realized following steps when executing:
SLDA-TC textual classification model is constructed, each document of the Training document collection of SLDA-TC textual classification model has classification
Label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, theme-Word probability
Distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation according to SLDA-TC-Gibbs algorithm;Its
In, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: to the implicit theme of each word into
Row sampling, and implicit theme sampling is only carried out from other training texts identical with text categories label where the word;True
After the implicit theme of fixed each word, by counting theme-word, document-theme, theme-classification frequency, text is calculated
Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution, and then establish out between theme and classification
Accurate mapping;
Text subject to be measured is inferred and classification;The SLDA-TC textual classification model that text input to be measured is completed to training, first
Implicit theme sampling is carried out to each word of document to be measured;Then infer the theme probability distribution of text to be measured;According to document to be measured
Theme distribution and SLDA-TC model theme-category distribution, export the class label of text to be measured.
6. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that text master
It inscribes probability distribution, the Word probability distribution of theme and the class probability distribution of theme and obeys Dirichlet distribution.
7. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that by more
Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme
Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.
8. as claimed in claim 7 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that the text
This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811398232.1A CN109408641B (en) | 2018-11-22 | 2018-11-22 | Text classification method and system based on supervised topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811398232.1A CN109408641B (en) | 2018-11-22 | 2018-11-22 | Text classification method and system based on supervised topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109408641A true CN109408641A (en) | 2019-03-01 |
CN109408641B CN109408641B (en) | 2020-06-02 |
Family
ID=65474659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811398232.1A Active CN109408641B (en) | 2018-11-22 | 2018-11-22 | Text classification method and system based on supervised topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408641B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110321434A (en) * | 2019-06-27 | 2019-10-11 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | bayesian-based LDA topic label calibration method, system and medium |
CN110795564A (en) * | 2019-11-01 | 2020-02-14 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110825850A (en) * | 2019-11-07 | 2020-02-21 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN111368532A (en) * | 2020-03-18 | 2020-07-03 | 昆明理工大学 | Topic word embedding disambiguation method and system based on LDA |
CN111723198A (en) * | 2019-03-18 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Text emotion recognition method and device and storage medium |
CN112733542A (en) * | 2021-01-14 | 2021-04-30 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN103810500A (en) * | 2014-02-25 | 2014-05-21 | 北京工业大学 | Place image recognition method based on supervised learning probability topic model |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
-
2018
- 2018-11-22 CN CN201811398232.1A patent/CN109408641B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN103810500A (en) * | 2014-02-25 | 2014-05-21 | 北京工业大学 | Place image recognition method based on supervised learning probability topic model |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
Non-Patent Citations (4)
Title |
---|
BLEI D M: "Supervised topic models", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 * |
SHIBIN ZHOU: "Text Categotization Based on Topic Model", 《INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS》 * |
李文波: "基于Labeled-LDA 模型的文本分类新算法", 《计算机学报》 * |
王丹丹: "基于宏特征融合的文本分类", 《中文信息学报》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723198B (en) * | 2019-03-18 | 2023-09-01 | 北京汇钧科技有限公司 | Text emotion recognition method, device and storage medium |
CN111723198A (en) * | 2019-03-18 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Text emotion recognition method and device and storage medium |
CN110135592B (en) * | 2019-05-16 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Classification effect determining method and device, intelligent terminal and storage medium |
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110321434A (en) * | 2019-06-27 | 2019-10-11 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | bayesian-based LDA topic label calibration method, system and medium |
CN110569270B (en) * | 2019-08-15 | 2022-07-05 | 中国人民解放军国防科技大学 | Bayesian-based LDA topic label calibration method, system and medium |
CN110795564B (en) * | 2019-11-01 | 2022-02-22 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110795564A (en) * | 2019-11-01 | 2020-02-14 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110825850B (en) * | 2019-11-07 | 2022-07-08 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN110825850A (en) * | 2019-11-07 | 2020-02-21 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN111368532B (en) * | 2020-03-18 | 2022-12-09 | 昆明理工大学 | Topic word embedding disambiguation method and system based on LDA |
CN111368532A (en) * | 2020-03-18 | 2020-07-03 | 昆明理工大学 | Topic word embedding disambiguation method and system based on LDA |
CN112733542B (en) * | 2021-01-14 | 2022-02-08 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN112733542A (en) * | 2021-01-14 | 2021-04-30 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
CN113032573B (en) * | 2021-04-30 | 2024-01-23 | 同方知网数字出版技术股份有限公司 | Large-scale text classification method and system combining topic semantics and TF-IDF algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN109408641B (en) | 2020-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408641A (en) | It is a kind of based on have supervision topic model file classification method and system | |
Muzellec et al. | Generalizing point embeddings using the wasserstein space of elliptical distributions | |
Qaisar | Sentiment analysis of IMDb movie reviews using long short-term memory | |
CN108363804B (en) | Local model weighted fusion Top-N movie recommendation method based on user clustering | |
Clinchant et al. | Aggregating continuous word embeddings for information retrieval | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
Kim et al. | Beyond sentiment: The manifold of human emotions | |
Mikawa et al. | A proposal of extended cosine measure for distance metric learning in text classification | |
CN111209402A (en) | Text classification method and system integrating transfer learning and topic model | |
Budhiraja et al. | A supervised learning approach for heading detection | |
CN109783633A (en) | Data analysis service procedural model recommended method | |
Pamuji | Performance of the K-Nearest Neighbors Method on Analysis of Social Media Sentiment | |
Park et al. | Phrase embedding and clustering for sub-feature extraction from online data | |
Steuber et al. | Topic modeling of short texts using anchor words | |
Hossny et al. | Enhancing keyword correlation for event detection in social networks using SVD and k-means: Twitter case study | |
CN110069558A (en) | Data analysing method and terminal device based on deep learning | |
CN111625578B (en) | Feature extraction method suitable for time series data in cultural science and technology fusion field | |
Dachapally et al. | In-depth question classification using convolutional neural networks | |
CN112463974A (en) | Method and device for establishing knowledge graph | |
WO2022183019A9 (en) | Methods for mitigation of algorithmic bias discrimination, proxy discrimination and disparate impact | |
Yang et al. | Autonomous semantic community detection via adaptively weighted low-rank approximation | |
Li et al. | Dataset complexity assessment based on cumulative maximum scaled area under Laplacian spectrum | |
CN111737469A (en) | Data mining method and device, terminal equipment and readable storage medium | |
Putra et al. | Analyzing sentiments on official online lending platform in Indonesia with a Combination of Naive Bayes and Lexicon Based Method | |
CN110598192A (en) | Text feature reduction method based on neighborhood rough set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |