CN109408641A - It is a kind of based on have supervision topic model file classification method and system - Google Patents

It is a kind of based on have supervision topic model file classification method and system Download PDF

Info

Publication number
CN109408641A
CN109408641A CN201811398232.1A CN201811398232A CN109408641A CN 109408641 A CN109408641 A CN 109408641A CN 201811398232 A CN201811398232 A CN 201811398232A CN 109408641 A CN109408641 A CN 109408641A
Authority
CN
China
Prior art keywords
theme
slda
classification
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811398232.1A
Other languages
Chinese (zh)
Other versions
CN109408641B (en
Inventor
唐焕玲
窦全胜
于立萍
宋英杰
鲁眀羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Technology and Business University
Original Assignee
Shandong Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Technology and Business University filed Critical Shandong Technology and Business University
Priority to CN201811398232.1A priority Critical patent/CN109408641B/en
Publication of CN109408641A publication Critical patent/CN109408641A/en
Application granted granted Critical
Publication of CN109408641B publication Critical patent/CN109408641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

Present disclose provides a kind of based on the file classification method and system that have supervision topic model.Wherein, a kind of based on the file classification method for having supervision topic model, comprising: building SLDA-TC textual classification model;It during training SLDA-TC textual classification model, is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only carries out implicit theme sampling from other training texts identical with text categories label where the word;After determining the implicit theme of each word, by counting the frequency, text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution is calculated;Establish the accurate mapping between theme and classification;The SLDA-TC textual classification model that text input to be measured is generated to training is inferred to the theme of text to be measured, and then predicts the classification of text.

Description

It is a kind of based on have supervision topic model file classification method and system
Technical field
This disclosure relates to data classification field more particularly to it is a kind of based on have supervision topic model file classification method And system.
Background technique
Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.
Text representation is the important step of text mining, and widest document representation method is bag of words method (Bag- at present of-word,BOW).Bag of words method regards a text set of word as, and assume each word appearance be it is independent, disobey The other words of Lai Yu, and ignore the information such as word order, syntax.Based on BOW, a text is indicated with a n-dimensional vector, per one-dimensional right A word is answered, the relevant weight of the frequency of the usually word, here it is the most commonly used is vector space model (vector space model,VSM).Due to the complexity of natural language, there is " dimension disaster ", " sparsity ", " semanteme in text representation The problems such as loss ".Bag of words method ignores the information such as word order, syntax, so that the semantic information of word is difficult to extract and quantify, text This semantic expressiveness is still very difficult at present.
The word2vec model that Mikolov et al. is proposed, is a kind of training method of term vector, utilizes the context of word One word is converted to a low-dimensional real vector by information, and more similar word is closer in vector space.Word2vec model What training exported is the term vector of each word, and the term vector of all words of text forms text vector.Based on word2vec model Trained term vector text input deep neural network, is used successfully to Chinese word segmentation, POS tagging, emotional semantic classification, syntax Dependence etc..Word2vec model is able to solve " sparsity " problem, although word2vec is capable of quantificational word and word Similarity, but not can solve " semanteme is lost " and " dimension disaster " problem of text.
Topic model (topic model) be can be used for solving " dimension disaster ", a kind of method of " sparsity ", and The semantic information of word can be extracted to a certain extent.Topic model indexes (Latent Semantic originating from Latent Semantic Indexing, LSI), and (probabilistic Latent is indexed by the probability Latent Semantic that Hofmann is proposed Semantic Indexing,pLSI).On the basis of pLSI, Blei et al. proposes LDA (Latent Dirichlet Allocation) topic model.Theme regards the probability distribution of word as in LDA, and the word of semantic similarity is built by implicit theme Vertical association, can extract semantic information, by text representation from higher-dimension word spatial alternation to low-dimensional theme space from text. Topic model is directly or extension use is in natural language processing field, such as cluster and classification, word sense disambiguation, sentiment analysis, figure As tasks such as the target detection of process field and positioning, image segmentations.
LDA topic model by text representation from the word spatial alternation of higher-dimension to the theme space of low-dimensional, then using KNN, Naive Bayesian, SVM scheduling algorithm Direct Classification, effect are simultaneously bad.Reason is that LDA topic model is unsupervised It practises, does not consider the classification of text, there is no classification this important informations marked using training text.
Existing improved method, if Li et al. people proposes Labled-LDA model, inventor has found the model for every class One LDA model of document training, the parameter for needing to estimate increase more times, increase the complexity of model.
Summary of the invention
According to the one aspect of one or more other embodiments of the present disclosure, provide a kind of based on there is supervision topic model File classification method can identify the semantic relation between theme-classification, establish the Precision Mapping of theme and classification.
One or more other embodiments of the present disclosure, provide it is a kind of based on have supervision topic model file classification method, Include:
Construct SLDA-TC textual classification model, each document band of the Training document collection of SLDA-TC textual classification model There is class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, master Topic-Word probability distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter according to SLDA-TC-Gibbs algorithm and estimates Meter;Wherein, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: imply to each word Theme is sampled, and is only carried out implicit theme from other training texts identical with text categories label where the word and adopted Sample;After determining the implicit theme of each word, by counting theme-word, document-theme, theme-classification frequency, calculate Text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution are obtained, and then establishes out theme and class Accurate mapping between not;
Text subject to be measured is inferred and classification;The SLDA-TC text classification mould that text input to be measured is completed to training Type carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;According to The theme distribution of document to be measured and theme-category distribution of SLDA-TC model, export the class label of text to be measured.
In one or more embodiments, text-theme probability distribution, theme-Word probability distribution and theme-classification are general Dirichlet distribution is obeyed in rate distribution.
In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training Terminate, by JS divergence assess similarity between theme, by theme-category distribution parameter evaluation theme of SLDA-TC with Semantic relevancy between classification.
In one or more embodiments, the evaluation index of the classification results of SLDA-TC textual classification model, including it is macro Average nicety of grading (Macro-precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro- F1)。
One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device, Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute Stating when program is executed by processor can be realized following steps:
Construct SLDA-TC textual classification model, each document band of the Training document collection of SLDA-TC textual classification model There is class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, master Topic-Word probability distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter according to SLDA-TC-Gibbs algorithm and estimates Meter;Wherein, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: imply to each word Theme is sampled, and is only carried out implicit theme from other training texts identical with text categories label where the word and adopted Sample;After determining the implicit theme of each word, by counting theme-word, document-theme, theme-classification frequency, calculate Text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution are obtained, and then establishes out theme and class Accurate mapping between not;
Text subject to be measured is inferred and classification;The SLDA-TC text classification mould that text input to be measured is completed to training Type carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;According to The theme distribution of document to be measured and theme-category distribution of SLDA-TC model, export the class label of text to be measured.
The beneficial effect of the disclosure is:
The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion, Be distributed using text-theme probability distribution, the distribution of theme-Word probability and theme-class probability, extract word and theme, document with The semantic information mapping implied between theme, theme and classification, and theme quantity K need to only take slightly larger than categorical measure C, no Text classification precision is improved only, and can be improved time efficiency.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, the disclosure Illustrative embodiments and their description do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is a kind of SLDA-TC file classification method flow chart of the disclosure.
Fig. 2 is LDA topic model.
Fig. 3 is SLDA-TC textual classification model.
When Fig. 4 (a) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compares.
When Fig. 4 (b) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compares.
When Fig. 4 (c) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compares.
When Fig. 5 (a) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM Precision compares.
When Fig. 5 (b) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM Recall compares.
When Fig. 5 (c) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM F1 compares.
Fig. 6 (a) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compare.
Fig. 6 (b) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compare.
Fig. 6 (c) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compare.
When Fig. 7 (a) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compare.
When Fig. 7 (b) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compare.
When Fig. 7 (c) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compare.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless Otherwise indicated, all technical and scientific terms used herein has and disclosure person of an ordinary skill in the technical field Normally understood identical meanings.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular shape Formula be also intended to include plural form, additionally, it should be understood that, when in the present specification use term "comprising" and/or When " comprising ", existing characteristics, step, operation, device, component and/or their combination are indicated.
Term is explained:
Dirichlet distribution: the distribution of Di Li Cray is the probability distribution of one group of continuous multivariable, is that multivariable generalizes Β distribution, Di Li Cray be distributed frequently as Bayesian statistics prior probability.
Gibbs Sampling method: gibbs sampler is a kind of calculation based on Markov Monte Carlo (MCMC) Method, for when being difficult to directly sample from the distribution of a certain multivariate probability approximate sample drawn sequence.
LDA-TC: refer to and first pass through the theme that LDA topic model extracts certain amount, then divided according to LDA model The method of class.
SVM:Support Vector Machine, refers to support vector machines, is a kind of common method of discrimination.? Machine learning field is the learning model for having supervision, commonly used to carry out pattern-recognition, classification and regression analysis.
Macro-Precision: macro average nicety of grading.
Macro-Recall: macro average recall rate.
Macro-F1: macro average F1 value.
Fig. 1 is a kind of based on the file classification method flow chart for having supervision topic model of the disclosure.
As shown in Figure 1, a kind of file classification method of the present embodiment, comprising:
S110: building SLDA-TC textual classification model;Different from unsupervised LDA model, SLDA-TC text classification mould Each document of the Training document collection of type has class label;Parameter to be estimated includes text in SLDA-TC textual classification model Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution.
As shown in Fig. 2, the text collection of LDA topic modelM indicates textual data, and K indicates number of topics.Mould There are two parameter θs for typemWithWherein θmIndicate the theme probability distribution of m texts,Indicate the Word probability distribution of theme k, wmIt is the bag of words vector of m texts, NmIndicate the length of m texts, wm,nIndicate n-th of word in m texts, zm,nIt is Distribute to wm,nTheme.θmWithDirichlet distribution is obeyed, generates theme, word respectively as polynomial parameters, α, β are The Study first of corresponding Dirichlet distribution.LDA topic model does not consider the classification of each document.
As shown in figure 3, the training text collection of SLDA-TC model different from LDAEvery text wmIt deposits In the classification number y of an observablem∈ [1, C], C indicate classification number, it is assumed that classification number is obeyed and text subject probability correlation Multinomial distribution.wm,nAnd ymIt is observable, zm,nIt is implicit theme.SLDA-TC model is in addition to parameterAnd θm, the disclosure Introduce a new parameter δk, indicate the class probability distribution of k-th of theme.θmAnd δkDirichlet distribution is obeyed, is made Generate theme, word and classification respectively for polynomial parameters, α, β and γ are the Study firsts of corresponding Dirichlet distribution.
S120:SLDA-TC topic model parameter Estimation, training SLDA-TC textual classification model, establishes theme and classification Between mapping.
LDA using Gibbs Sampling algorithm estimation parameter beAnd the parameter that θ, SLDA-TC model need to estimate It isθ and δ.The disclosure proposes SLDA-TC-Gibbs algorithm on the basis of Gibbs Sampling method, does not count directly Calculate θmAnd δk, first the implicit theme of each word is sampled, after determining the implicit theme of each word, θmAnd δk The frequency can be counted to be calculated.
Gibbs Sampling algorithm measures z to word w mono- hidden variation every timei=k is sampled, and other words point of z are kept Amount isIt is constant, it calculates such as formula (1).
SLDA-TC-Gibbs algorithm is every time to word wiThe hidden variation of one of=t measures ziWhen=k is sampled, its of z is kept Its word component isIt is constant, while also keepingIt is constant, it calculates such as formula (2).
Wherein, w=(w1,...,wm) it is all document term vectors, z=(z1,...,zV) it is that (V is word word to theme vector Allusion quotation length), y=(y1,...yM) it is categorization vector.Assuming that i-th of word wi=t, ziIndicate the corresponding theme of i-th of word Variable,Indicate i-th of rejecting vector z.I-th of word w is assumed simultaneouslyiDocument where=t is m document wm, and its Class is marked as ym=j, j ∈ [1, C], thenIndicate m of rejecting vector y.Indicate that theme k distributes to time of word v Number, βvIndicate the Dirichlet priori of word v.Indicate that document m distributes to the number of theme z, αzIt is the Dirichlet of theme z Priori.Indicate i-th (i.e. i-th of word w of rejecting zi=t), theme k distributes to the number of word t, βvIt is word t Dirichlet priori.It indicates to reject m (m document wmClass be labeled as ym=j), theme k is assigned as classification j Number of files, γjIndicate the Dirichlet Study first of classification j.
Contrast (1) introduces on formula (2) left sideThe right increases by oneTo in m documents word it is hidden The sampling containing theme is limited, i.e., only carries out implicit theme from other Training documents identical with m document classifications and adopt Sample because the theme distribution of the document of the same category be it is similar, to formula (2) when having classification mark more rationally.This public affairs It opens and only generates a SLDA-TC model, need to only estimate one group of parameterθ and δ.The derivation of formula (2) proves as follows:
It proves: given Training document collectionEnable w=(w1,...,wm), y=(y1,...yM), z= (z1,...,zV), the Joint Distribution such as (3) formula of SLDA-TC probabilistic model.
P (w, z, y | α, β, γ)=p (w | z, β) p (z | α) p (y | z, γ) (3)
It is distributed from Dirichlet:
Wherein,Γ () is Gamma Function.
According to SLDA-TC-Gibbs algorithm,
To the word w of m documentsi=t,It is constant, therefore
By (3)-(6) and (8), can obtain
(2) formula can be demonstrate,proved as a result,.
After the theme z label for obtaining each word w, the parameter of SLDA-TC modelθmAnd δkIt calculates as follows.
WhereinIndicate that theme k distributes to the probability of word t, θm,kIndicate that document m distributes the probability for the k that is the theme, δk,jTable Show that theme k belongs to the probability of classification j,To indicate that theme k distributes to the number of word t,Indicate that document m distributes to theme k Number,Indicate that theme is the number of files that k is assigned as classification j, k=1..K, t=1..V, m=1..M.
Thus the parameter of SLDA-TC modelθ and δ estimation finishes.
SLDATC-Gibbs algorithm description:
Algorithm:
Input: document vectorHyper parameter α, β, γ, number of topics K, the number of iterations T.
Output: theme z distribution, parameterθ and δ
Initialization: variableIt is initialized as 0.
After training generates SLDA-TC model, formula such as (13) are inferred to the implicit theme of each word in d new documents.
Wherein,Indicate the term vector of d new documents,Its theme vector is represented,Indicate that new document rejects i-th Item (i.e. i-th of word), theme k distributes to the number of word t,It indicates to reject i-th, d new documents distribute to master Inscribe the number of k, other symbol meaning reference formulas (2).
The implicit theme label of the new each word of document d is calculated by formula (13), then calculates the probability that d belongs to each theme Such as formula (14).
The theme probability distribution of new document d is obtained as a result,
S130: the SLDA-TC textual classification model that text input to be measured is completed to training is inferred to text to be measured Theme, and then predict text classification.
It is given to train SLDA-TC model, it enablesFor the term vector of d new documents,For its theme vector,It is pair The class prediction of new document d, the classified calculating of new document are as follows.
It is assumed that test sample collection and training sample set are obeyed with distribution, i.e., the two is in implicit theme-category distribution Unanimously, therefore,It can be replaced by p (y | z), be disclosed by the parameter δ of SLDA-TC model.It is new document d Theme probability distributionBy formula (12), (14) and (15), can obtain:
Similarity between theme is assessed with JS divergence (Jensen-Shannon divergence), and JS divergence is also referred to as JS Distance is a kind of deformation of KL divergence (Kullback-Leibler divergence), calculates such as formula (17).Different from KL Divergence, JS divergence meet symmetry and triangle range formula.
JS(pi||pj)=0.5*KL (pi||(pi+pj)/2)+0.5*KL(pj||(pi+pj)/2) (17)
Wherein, piAnd pjThe Word probability distribution of theme i and theme j is respectively indicated, the codomain range of JS divergence is [0,1], 0 Indicate piAnd pjIt is distributed identical, 1 indicates opposite.
Semantic relevancy between theme and classification is measured by the parameter δ of SLDA-TC model, is calculated as shown in formula (12), δk,jIndicate that theme k belongs to the probability of classification j.
In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training Terminate, the similarity between theme assessed by JS divergence, by theme-category distribution parameter evaluation theme of SLDA-TC with Semantic relevancy between classification;The macro average nicety of grading (Macro- of the text classification evaluation of result index of SLDA-TC model Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1).
Experimental analysis verifying:
Tri- data subsets of rec, sci and talk of English data set 20newsgroup are chosen, and include IT, army Thing, education, 5 classes of tourism and finance and economics search dog Chinese corpus data subset, training sample and test specimens in each data subset This ratio is 8:2, and data subset description is as described in Table 1.The participle of Chinese and English data set is segmented using jieba, and English stem mentions It takes using nltk.stem, after removing stop words, feature selecting is carried out using TF-IDF, retains 60% feature in experiment Word.
The description of 1. data set of table
Data set Training text number Classification number Characteristic
20news-rec 3979 4 47067
20news-sci 2373 4 57943
20news-talk 1676 3 40945
sogou 2445 5 70819
For the validity for verifying proposed method, experiment is carried out to tri- algorithms of SLDA-TC, LDA-TC and SVM and has been compared. SLDA-TC is the algorithm proposed in the disclosure, and LDA-TC is in traditional LDA Direct Classification, and SVM is K using LDA model Svm classifier algorithm of the theme as feature.
The target that the theme of SLDA-TC model is inferred is the mapping established between theme and classification, number of topics K and label The classification number C of training set is related, and experiment shows K only and need to take a value slightly larger than C, and SLDA-TC can reach to be divided well Class precision.JS distance, preceding 10 features of each theme in experiment, the theme that we generate SLDA-TC topic model Similarity between the probability distribution of word, class and theme is analyzed.Sogou data subset C=is described such as 2~table of table 4 5, the experimental result of the SLDA-TC topic model generated when K=8,6~table of table 8 describe 20news-talk data subset On experimental result, wherein α, β and γ value of SLDA-TC topic model be 0.01.
JS divergence (SLDA-TC, sogou, C=5, K=8) between 2 theme of table
The degree of correlation (SLDA-TC, sogou, C=5, K=8) of 3 theme of table and class
The probability distribution (SLDA-TC, sogou, C=5, K=8) of 10 words before 4 theme of table
As shown in table 2, the JS divergence between theme 2,5,7 is 0, illustrates that they are the identical themes of distribution, can from table 4 To find out, the probability distribution of 10 Feature Words is also identical before this 3 themes.JS divergence between other 5 themes exists Between 0.45 to 0.58, explanation is 5 different themes of dispersion.As can be seen from tables 3 and 4 that 0 mapclass of theme " IT ", 1 mapclass of theme " tourism ", 3 mapclass of theme " finance and economics ", 4 mapclass of theme " military affairs ", " religion of 6 mapclass of theme Educate ", similarity is 99% or more, and theme 2,5,7 is, referred to as " useless " theme unrelated with any classification.
JS divergence (SLDA-TC 20news-talk, C=3, K=6) between 5. theme of table
The degree of correlation (SLDA-TC, 20news-talk, C=3, K=6) of 6 class of table and theme
The probability distribution (SLDA-TC, 20news-talk, C=3, K=6) of 10 words before 7 theme of table
It is the experimental result on 20news-talk shown in 5~table of table 7, theme 1,3 and 4 is that distribution is identical " useless " Theme, the corresponding classification talk.politics.guns of significant theme 0, the corresponding talk.politics.misc of theme 2, master The corresponding talk.politics.mideast of topic 5.
By many experiments it was found that as K-C >=2, the " useless " main of the theme that K-C JS divergence is 0 can be generated Topic, as long as therefore K selection be slightly larger than C value, so the K of SLDA-TC topic model be easy determination.SLDA-TC master Topic model can filter out K-C " useless " themes, establish the accurate mapping of theme and classification.Meanwhile K is only slightly larger than C, far K value lower than LDA, can significantly reduce the training time of model.
Fig. 4 (a)~Fig. 7 (c) describe sogou Chinese, 20newsgroup 3 Sub Data Sets on, by TF- IDF feature selecting retains 60% Feature Words, and number of topics K takes 10 generation SLDA models of different value iteration and LDA model, The classification results of SLDA-TC, LDA-TC and SVM compare.
In Fig. 4 (a)~Fig. 7 (c), abscissa indicates the number of iterations.
When Fig. 4 (a)~Fig. 7 (c) describes that number of topics K takes different value on 4 data sets, SLDA-TC, LDA-TC and SVM Classification results compare.
(1) K need to only be slightly larger than classification number C when the theme of SLDA-TC, and classification results are better than LDA-TC and SVM.
As shown in Fig. 4 (a)-Fig. 4 (c), when 20news-rec data set C=4, K=8, for Macro-Precision, Macro-Recall and Macro-F1 classification indicators, SLDA-TC 95.10%, 94.99% and 94.98%, and LDA-TC is 63.76%, 60.91% and 60.33%, SVM 68.82%, 68.33%, 68.08%.As K increases, LDA-TC and SVM It increases, when K=80, it is 71.85%, 71.38% and 71.41% that LDA-TC, which is up to, and SVM is up to 83.90%, 83.70% and 83.62%, also below SLDA-TC.In addition the training time of K=80, topic model are much high In the training time of SLDA-TC model K=6.
Shown in as shown in Fig. 5 (a)-Fig. 5 (c), when sogou data set C=5, K=8, for Macro-Precision, Macro-Recall and Macro-F1 classification indicators, SLDA-TC 92.80%, 92.73% and 92.70%, and LDA-TC is 72.67%, 68.89% and 67.48%, SVM 80.69%, 80.40%, 80.28%.With the increase of K, the classification of SVM Index steps up, and when K is greater than 60, three kinds of classification indicators have risen to 89.26%, 89.95%, 89.24%, but also 92.80%, 92.73% and the 92.70% of SLDA-TC when being less than K=8, and the SVM (LDA theme is characterized) of K=60, More time costs are paid than the SLDA-TC of K=8.
(2) the K value of SLDA-TC is not the bigger the better, and when K is very big, three classification indicators of SLDA-TC are instead Decline, this shows that K need to only be slightly larger than C.
Shown in as shown in Fig. 6 (a)-Fig. 6 (c), 20news-sci data set C=4, as K=90, SLDA-TC classification knot Fruit is instead very poor.Reason is classification number C=4, and only 4 themes are related to classification in 90 themes of generation, remaining 86 It is " useless " theme, classification is not helped, interference is caused instead, classification results is caused to be deteriorated.Such as Fig. 7 (a)-Fig. 7 (c) Shown in 20news-talk data set C=3, K=90 when be also such.Largely the experimental results showed that, SLDA-TC is calculated Method, K take the value slightly larger than C, can be obtained very high nicety of grading.
SLDA-TC is as shown in table 8 compared with LDA-TC, SVM time performance and classification results on different data sets.
Table 8.SLDA-TC is compared with LDA-TC, SVM time performance and classification results
The generation time of topic model is directly proportional to number of topics K, and K value is bigger, and time cost is higher.To SLDA-TC mould Type, number of topics K only need to can be obtained extraordinary classification results slightly larger than classification number C, and LDA-TC and use LDA theme for The SVM algorithm of feature then needs K value to reach tens, Shang Baishi, could obtain preferable classification results, from Fig. 4 (a)-Fig. 7 (d) Shown in experimental result it is also seen that.
As shown in table 8, on 20news-rec, LDA and SLDA-TC model generation time ratio are 4.86, i.e. SLDA ratio Fast 4.86 times of LDA, because the K of the two is respectively 200 and 8.On 20news-sci, 20news-talk and sogou data set, Fast 4.78,5.16 and 4.90 times of SLDA ratio LDA difference.Meanwhile from Macro-Precision, Macro-Recall and Macro- From the point of view of F1 classification indicators, SLDA-TC algorithm is substantially better than LDA-TC and SVM algorithm, on 4 kinds of data sets, SLDA-TC ratio SVM high Out 3.10%~9.30%, 7.10%~34.08% is higher by than LDA-TC.
In conclusion SLDA-TC model number of topics K need to only take the value slightly larger than classification number C, can identify close with classification Relevant theme is cut, and is all substantially better than LDA-TC in nicety of grading and time performance and LDA theme is used to be characterized SVM algorithm.
The disclosure the text classification the problem of, proposes a kind of based on there is supervision topic model for LDA SLDA-TC textual classification model proposes SLDA-TC-Gibbs parameter estimation algorithm, every time to word wiA hidden change of=t Component ziWhen=k is sampled, other word components of z are kept i.e.It is constant, while also keepingIt is constant, i.e., only from the word institute Implicit theme sampling is carried out in the identical other Training documents of document class label, because of the theme of the document of the same category point Cloth is similar, and gives theoretical proof.SLDA-TC model introduces theme-class probability distribution parameter δ, passes throughθ and δ probability distribution extracts the semantic information implied between word and theme, document and theme, theme and classification and maps.Separately Outside, SLDA-TC number of topics K need to only be slightly larger than the value of classification number.Experiment shows that SLDA-TC model can significantly improve classification essence Degree and time efficiency.
One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device, Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute It states and can be realized following steps as shown in Figure 1 when program is executed by processor:
(1) SLDA-TC textual classification model is constructed;Different from unsupervised LDA model, SLDA-TC textual classification model Each document of Training document collection has class label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes Text-theme probability distribution, theme-Word probability distribution further include theme-class probability distribution;Wherein, text subject probability Dirichlet distribution is obeyed in distribution, the Word probability distribution of theme and the class probability distribution of theme.
(2) training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation.
Specifically, during training SLDA-TC textual classification model, number of topics K is first set, is taken slightly larger than classification The value of number C;Then it is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only where with the word Implicit theme sampling is carried out in the identical other training texts of text categories label;After determining the implicit theme of each word, By counting theme-word, document-theme, theme-classification frequency, text-theme probability distribution, theme-word is calculated Probability distribution and theme-class probability distribution;Establish the accurate mapping between theme and classification;
(3) text subject to be measured infers and classifies.The SLDA-TC text classification that text input to be measured is completed to training Model carries out implicit theme sampling to each word of document to be measured first;Then infer the theme probability distribution of text to be measured;Root According to the theme distribution of document to be measured and theme-category distribution of SLDA-TC model, the class label of text to be measured is exported.
(4) SLDA-TC model and classification results assessment.To the SLDA-TC model that successive ignition training generates, pass through JS Divergence assesses the similarity between theme, by the theme-of SLDA-TC between category distribution parameter evaluation theme and classification Semantic relevancy;The text classification evaluation of result of SLDA-TC model is by the macro average nicety of grading (Macro- of index Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1) assessment.
The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion, It is distributed using the Word probability distribution of text subject probability distribution, theme and the class probability of theme, extracts word and theme, document The semantic information mapping implied between theme, theme and classification, improves text classification precision and time efficiency.
It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer journey Sequence product.Therefore, hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure Form.It is deposited moreover, the disclosure can be used to can be used in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on storage media (including but not limited to magnetic disk storage and optical memory etc.).
The disclosure be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions each in flowchart and/or the block diagram The combination of process and/or box in process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that instruction stored in the computer readable memory generation includes The manufacture of command device, the command device are realized in one box of one or more flows of the flowchart and/or block diagram Or the function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that Series of operation steps are executed on computer or other programmable devices to generate computer implemented processing, thus calculating The instruction executed on machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side The step of function of being specified in block diagram one box or multiple boxes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can It is completed with instructing relevant hardware by computer program, the program can be stored in a computer-readable storage In medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can For magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..
Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, not the disclosure is protected The limitation of range, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art Member does not need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.

Claims (8)

1. a kind of based on the file classification method for having supervision topic model characterized by comprising
SLDA-TC textual classification model is constructed, each document of the Training document collection of SLDA-TC textual classification model has classification Label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, theme-Word probability Distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation according to SLDA-TC-Gibbs algorithm;Its In, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: to the implicit theme of each word into Row sampling, and implicit theme sampling is only carried out from other training texts identical with text categories label where the word;True After the implicit theme of fixed each word, by counting theme-word, document-theme, theme-classification frequency, text is calculated Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution, and then establish out between theme and classification Accurate mapping;
Text subject to be measured is inferred and classification;The SLDA-TC textual classification model that text input to be measured is completed to training, first Implicit theme sampling is carried out to each word of document to be measured;Then infer the theme probability distribution of text to be measured;According to document to be measured Theme distribution and SLDA-TC model theme-category distribution, export the class label of text to be measured.
2. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that SLDA- The parameter for needing to estimate in TC textual classification model not only includes text-theme probability distribution, theme-Word probability distribution, is also wrapped Theme-class probability distribution is included, parameter obeys Dirichlet distribution.
3. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that by more Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.
4. as claimed in claim 3 a kind of based on the file classification method for having supervision topic model, which is characterized in that the text This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.
5. a kind of based on the Text Classification System for having supervision topic model, including text input device, controller and display device, The controller includes memory and processor, which is characterized in that the memory is stored with computer program, described program quilt Processor can be realized following steps when executing:
SLDA-TC textual classification model is constructed, each document of the Training document collection of SLDA-TC textual classification model has classification Label;The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, theme-Word probability Distribution further includes theme-class probability distribution;
Training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation according to SLDA-TC-Gibbs algorithm;Its In, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: to the implicit theme of each word into Row sampling, and implicit theme sampling is only carried out from other training texts identical with text categories label where the word;True After the implicit theme of fixed each word, by counting theme-word, document-theme, theme-classification frequency, text is calculated Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution, and then establish out between theme and classification Accurate mapping;
Text subject to be measured is inferred and classification;The SLDA-TC textual classification model that text input to be measured is completed to training, first Implicit theme sampling is carried out to each word of document to be measured;Then infer the theme probability distribution of text to be measured;According to document to be measured Theme distribution and SLDA-TC model theme-category distribution, export the class label of text to be measured.
6. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that text master It inscribes probability distribution, the Word probability distribution of theme and the class probability distribution of theme and obeys Dirichlet distribution.
7. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that by more Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.
8. as claimed in claim 7 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that the text This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.
CN201811398232.1A 2018-11-22 2018-11-22 Text classification method and system based on supervised topic model Active CN109408641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811398232.1A CN109408641B (en) 2018-11-22 2018-11-22 Text classification method and system based on supervised topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811398232.1A CN109408641B (en) 2018-11-22 2018-11-22 Text classification method and system based on supervised topic model

Publications (2)

Publication Number Publication Date
CN109408641A true CN109408641A (en) 2019-03-01
CN109408641B CN109408641B (en) 2020-06-02

Family

ID=65474659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811398232.1A Active CN109408641B (en) 2018-11-22 2018-11-22 Text classification method and system based on supervised topic model

Country Status (1)

Country Link
CN (1) CN109408641B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135592A (en) * 2019-05-16 2019-08-16 腾讯科技(深圳)有限公司 Classifying quality determines method, apparatus, intelligent terminal and storage medium
CN110321434A (en) * 2019-06-27 2019-10-11 厦门美域中央信息科技有限公司 A kind of file classification method based on word sense disambiguation convolutional neural networks
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN110825850A (en) * 2019-11-07 2020-02-21 哈尔滨工业大学(深圳) Natural language theme classification method and device
CN111368532A (en) * 2020-03-18 2020-07-03 昆明理工大学 Topic word embedding disambiguation method and system based on LDA
CN111723198A (en) * 2019-03-18 2020-09-29 北京京东尚科信息技术有限公司 Text emotion recognition method and device and storage medium
CN112733542A (en) * 2021-01-14 2021-04-30 北京工业大学 Theme detection method and device, electronic equipment and storage medium
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662960A (en) * 2012-03-08 2012-09-12 浙江大学 On-line supervised theme-modeling and evolution-analyzing method
CN103810500A (en) * 2014-02-25 2014-05-21 北京工业大学 Place image recognition method based on supervised learning probability topic model
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662960A (en) * 2012-03-08 2012-09-12 浙江大学 On-line supervised theme-modeling and evolution-analyzing method
CN103810500A (en) * 2014-02-25 2014-05-21 北京工业大学 Place image recognition method based on supervised learning probability topic model
CN103903164A (en) * 2014-03-25 2014-07-02 华南理工大学 Semi-supervised automatic aspect extraction method and system based on domain information
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BLEI D M: "Supervised topic models", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 *
SHIBIN ZHOU: "Text Categotization Based on Topic Model", 《INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS》 *
李文波: "基于Labeled-LDA 模型的文本分类新算法", 《计算机学报》 *
王丹丹: "基于宏特征融合的文本分类", 《中文信息学报》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723198B (en) * 2019-03-18 2023-09-01 北京汇钧科技有限公司 Text emotion recognition method, device and storage medium
CN111723198A (en) * 2019-03-18 2020-09-29 北京京东尚科信息技术有限公司 Text emotion recognition method and device and storage medium
CN110135592B (en) * 2019-05-16 2023-09-19 腾讯科技(深圳)有限公司 Classification effect determining method and device, intelligent terminal and storage medium
CN110135592A (en) * 2019-05-16 2019-08-16 腾讯科技(深圳)有限公司 Classifying quality determines method, apparatus, intelligent terminal and storage medium
CN110321434A (en) * 2019-06-27 2019-10-11 厦门美域中央信息科技有限公司 A kind of file classification method based on word sense disambiguation convolutional neural networks
CN110569270A (en) * 2019-08-15 2019-12-13 中国人民解放军国防科技大学 bayesian-based LDA topic label calibration method, system and medium
CN110569270B (en) * 2019-08-15 2022-07-05 中国人民解放军国防科技大学 Bayesian-based LDA topic label calibration method, system and medium
CN110795564B (en) * 2019-11-01 2022-02-22 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN110795564A (en) * 2019-11-01 2020-02-14 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN110825850B (en) * 2019-11-07 2022-07-08 哈尔滨工业大学(深圳) Natural language theme classification method and device
CN110825850A (en) * 2019-11-07 2020-02-21 哈尔滨工业大学(深圳) Natural language theme classification method and device
CN111368532B (en) * 2020-03-18 2022-12-09 昆明理工大学 Topic word embedding disambiguation method and system based on LDA
CN111368532A (en) * 2020-03-18 2020-07-03 昆明理工大学 Topic word embedding disambiguation method and system based on LDA
CN112733542B (en) * 2021-01-14 2022-02-08 北京工业大学 Theme detection method and device, electronic equipment and storage medium
CN112733542A (en) * 2021-01-14 2021-04-30 北京工业大学 Theme detection method and device, electronic equipment and storage medium
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN113032573B (en) * 2021-04-30 2024-01-23 同方知网数字出版技术股份有限公司 Large-scale text classification method and system combining topic semantics and TF-IDF algorithm

Also Published As

Publication number Publication date
CN109408641B (en) 2020-06-02

Similar Documents

Publication Publication Date Title
CN109408641A (en) It is a kind of based on have supervision topic model file classification method and system
Muzellec et al. Generalizing point embeddings using the wasserstein space of elliptical distributions
Qaisar Sentiment analysis of IMDb movie reviews using long short-term memory
CN108363804B (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
Clinchant et al. Aggregating continuous word embeddings for information retrieval
CN108959305A (en) A kind of event extraction method and system based on internet big data
Kim et al. Beyond sentiment: The manifold of human emotions
Mikawa et al. A proposal of extended cosine measure for distance metric learning in text classification
CN111209402A (en) Text classification method and system integrating transfer learning and topic model
Budhiraja et al. A supervised learning approach for heading detection
CN109783633A (en) Data analysis service procedural model recommended method
Pamuji Performance of the K-Nearest Neighbors Method on Analysis of Social Media Sentiment
Park et al. Phrase embedding and clustering for sub-feature extraction from online data
Steuber et al. Topic modeling of short texts using anchor words
Hossny et al. Enhancing keyword correlation for event detection in social networks using SVD and k-means: Twitter case study
CN110069558A (en) Data analysing method and terminal device based on deep learning
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
Dachapally et al. In-depth question classification using convolutional neural networks
CN112463974A (en) Method and device for establishing knowledge graph
WO2022183019A9 (en) Methods for mitigation of algorithmic bias discrimination, proxy discrimination and disparate impact
Yang et al. Autonomous semantic community detection via adaptively weighted low-rank approximation
Li et al. Dataset complexity assessment based on cumulative maximum scaled area under Laplacian spectrum
CN111737469A (en) Data mining method and device, terminal equipment and readable storage medium
Putra et al. Analyzing sentiments on official online lending platform in Indonesia with a Combination of Naive Bayes and Lexicon Based Method
CN110598192A (en) Text feature reduction method based on neighborhood rough set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant