CN109408641A

CN109408641A - It is a kind of based on have supervision topic model file classification method and system

Info

Publication number: CN109408641A
Application number: CN201811398232.1A
Authority: CN
Inventors: 唐焕玲; 窦全胜; 于立萍; 宋英杰; 鲁眀羽
Original assignee: Shandong Technology and Business University
Current assignee: Shandong Technology and Business University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-03-01
Anticipated expiration: 2038-11-22
Also published as: CN109408641B

Abstract

Present disclose provides a kind of based on the file classification method and system that have supervision topic model.Wherein, a kind of based on the file classification method for having supervision topic model, comprising: building SLDA-TC textual classification model；It during training SLDA-TC textual classification model, is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only carries out implicit theme sampling from other training texts identical with text categories label where the word；After determining the implicit theme of each word, by counting the frequency, text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution is calculated；Establish the accurate mapping between theme and classification；The SLDA-TC textual classification model that text input to be measured is generated to training is inferred to the theme of text to be measured, and then predicts the classification of text.

Description

It is a kind of based on have supervision topic model file classification method and system

Technical field

This disclosure relates to data classification field more particularly to it is a kind of based on have supervision topic model file classification method And system.

Background technique

Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.

Text representation is the important step of text mining, and widest document representation method is bag of words method (Bag- at present of-word,BOW).Bag of words method regards a text set of word as, and assume each word appearance be it is independent, disobey The other words of Lai Yu, and ignore the information such as word order, syntax.Based on BOW, a text is indicated with a n-dimensional vector, per one-dimensional right A word is answered, the relevant weight of the frequency of the usually word, here it is the most commonly used is vector space model (vector space model,VSM).Due to the complexity of natural language, there is " dimension disaster ", " sparsity ", " semanteme in text representation The problems such as loss ".Bag of words method ignores the information such as word order, syntax, so that the semantic information of word is difficult to extract and quantify, text This semantic expressiveness is still very difficult at present.

The word2vec model that Mikolov et al. is proposed, is a kind of training method of term vector, utilizes the context of word One word is converted to a low-dimensional real vector by information, and more similar word is closer in vector space.Word2vec model What training exported is the term vector of each word, and the term vector of all words of text forms text vector.Based on word2vec model Trained term vector text input deep neural network, is used successfully to Chinese word segmentation, POS tagging, emotional semantic classification, syntax Dependence etc..Word2vec model is able to solve " sparsity " problem, although word2vec is capable of quantificational word and word Similarity, but not can solve " semanteme is lost " and " dimension disaster " problem of text.

Topic model (topic model) be can be used for solving " dimension disaster ", a kind of method of " sparsity ", and The semantic information of word can be extracted to a certain extent.Topic model indexes (Latent Semantic originating from Latent Semantic Indexing, LSI), and (probabilistic Latent is indexed by the probability Latent Semantic that Hofmann is proposed Semantic Indexing,pLSI).On the basis of pLSI, Blei et al. proposes LDA (Latent Dirichlet Allocation) topic model.Theme regards the probability distribution of word as in LDA, and the word of semantic similarity is built by implicit theme Vertical association, can extract semantic information, by text representation from higher-dimension word spatial alternation to low-dimensional theme space from text. Topic model is directly or extension use is in natural language processing field, such as cluster and classification, word sense disambiguation, sentiment analysis, figure As tasks such as the target detection of process field and positioning, image segmentations.

LDA topic model by text representation from the word spatial alternation of higher-dimension to the theme space of low-dimensional, then using KNN, Naive Bayesian, SVM scheduling algorithm Direct Classification, effect are simultaneously bad.Reason is that LDA topic model is unsupervised It practises, does not consider the classification of text, there is no classification this important informations marked using training text.

Existing improved method, if Li et al. people proposes Labled-LDA model, inventor has found the model for every class One LDA model of document training, the parameter for needing to estimate increase more times, increase the complexity of model.

Summary of the invention

According to the one aspect of one or more other embodiments of the present disclosure, provide a kind of based on there is supervision topic model File classification method can identify the semantic relation between theme-classification, establish the Precision Mapping of theme and classification.

One or more other embodiments of the present disclosure, provide it is a kind of based on have supervision topic model file classification method, Include:

Construct SLDA-TC textual classification model, each document band of the Training document collection of SLDA-TC textual classification model There is class label；The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, master Topic-Word probability distribution further includes theme-class probability distribution；

Training SLDA-TC textual classification model carries out SLDA-TC model parameter according to SLDA-TC-Gibbs algorithm and estimates Meter；Wherein, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: imply to each word Theme is sampled, and is only carried out implicit theme from other training texts identical with text categories label where the word and adopted Sample；After determining the implicit theme of each word, by counting theme-word, document-theme, theme-classification frequency, calculate Text-theme probability distribution, theme-Word probability distribution and theme-class probability distribution are obtained, and then establishes out theme and class Accurate mapping between not；

Text subject to be measured is inferred and classification；The SLDA-TC text classification mould that text input to be measured is completed to training Type carries out implicit theme sampling to each word of document to be measured first；Then infer the theme probability distribution of text to be measured；According to The theme distribution of document to be measured and theme-category distribution of SLDA-TC model, export the class label of text to be measured.

In one or more embodiments, text-theme probability distribution, theme-Word probability distribution and theme-classification are general Dirichlet distribution is obeyed in rate distribution.

In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training Terminate, by JS divergence assess similarity between theme, by theme-category distribution parameter evaluation theme of SLDA-TC with Semantic relevancy between classification.

In one or more embodiments, the evaluation index of the classification results of SLDA-TC textual classification model, including it is macro Average nicety of grading (Macro-precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro- F1)。

One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device, Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute Stating when program is executed by processor can be realized following steps:

The beneficial effect of the disclosure is:

The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion, Be distributed using text-theme probability distribution, the distribution of theme-Word probability and theme-class probability, extract word and theme, document with The semantic information mapping implied between theme, theme and classification, and theme quantity K need to only take slightly larger than categorical measure C, no Text classification precision is improved only, and can be improved time efficiency.

Detailed description of the invention

The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, the disclosure Illustrative embodiments and their description do not constitute the improper restriction to the disclosure for explaining the disclosure.

Fig. 1 is a kind of SLDA-TC file classification method flow chart of the disclosure.

Fig. 2 is LDA topic model.

Fig. 3 is SLDA-TC textual classification model.

When Fig. 4 (a) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compares.

When Fig. 4 (b) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compares.

When Fig. 4 (c) is 20news-rec data set C=4, K=8, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compares.

When Fig. 5 (a) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM Precision compares.

When Fig. 5 (b) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM Recall compares.

When Fig. 5 (c) is sogou data set C=5, K=8, the Macro- of the classification results of SLDA-TC and LDA-TC, SVM F1 compares.

Fig. 6 (a) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compare.

Fig. 6 (b) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compare.

Fig. 6 (c) is 20news-sci data set C=4, as K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compare.

When Fig. 7 (a) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Precision compare.

When Fig. 7 (b) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-Recall compare.

When Fig. 7 (c) is 20news-talk data set C=3, K=90, the classification results of SLDA-TC and LDA-TC, SVM Macro-F1 compare.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless Otherwise indicated, all technical and scientific terms used herein has and disclosure person of an ordinary skill in the technical field Normally understood identical meanings.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular shape Formula be also intended to include plural form, additionally, it should be understood that, when in the present specification use term "comprising" and/or When " comprising ", existing characteristics, step, operation, device, component and/or their combination are indicated.

Term is explained:

Dirichlet distribution: the distribution of Di Li Cray is the probability distribution of one group of continuous multivariable, is that multivariable generalizes Β distribution, Di Li Cray be distributed frequently as Bayesian statistics prior probability.

Gibbs Sampling method: gibbs sampler is a kind of calculation based on Markov Monte Carlo (MCMC) Method, for when being difficult to directly sample from the distribution of a certain multivariate probability approximate sample drawn sequence.

LDA-TC: refer to and first pass through the theme that LDA topic model extracts certain amount, then divided according to LDA model The method of class.

SVM:Support Vector Machine, refers to support vector machines, is a kind of common method of discrimination.? Machine learning field is the learning model for having supervision, commonly used to carry out pattern-recognition, classification and regression analysis.

Macro-Precision: macro average nicety of grading.

Macro-Recall: macro average recall rate.

Macro-F1: macro average F1 value.

Fig. 1 is a kind of based on the file classification method flow chart for having supervision topic model of the disclosure.

As shown in Figure 1, a kind of file classification method of the present embodiment, comprising:

S110: building SLDA-TC textual classification model；Different from unsupervised LDA model, SLDA-TC text classification mould Each document of the Training document collection of type has class label；Parameter to be estimated includes text in SLDA-TC textual classification model Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution.

As shown in Fig. 2, the text collection of LDA topic modelM indicates textual data, and K indicates number of topics.Mould There are two parameter θs for type_mWithWherein θ_mIndicate the theme probability distribution of m texts,Indicate the Word probability distribution of theme k, w_mIt is the bag of words vector of m texts, N_mIndicate the length of m texts, w_m,nIndicate n-th of word in m texts, z_m,nIt is Distribute to w_m,nTheme.θ_mWithDirichlet distribution is obeyed, generates theme, word respectively as polynomial parameters, α, β are The Study first of corresponding Dirichlet distribution.LDA topic model does not consider the classification of each document.

As shown in figure 3, the training text collection of SLDA-TC model different from LDAEvery text w_mIt deposits In the classification number y of an observable_m∈ [1, C], C indicate classification number, it is assumed that classification number is obeyed and text subject probability correlation Multinomial distribution.w_m,nAnd y_mIt is observable, z_m,nIt is implicit theme.SLDA-TC model is in addition to parameterAnd θ_m, the disclosure Introduce a new parameter δ_k, indicate the class probability distribution of k-th of theme.θ_m、And δ_kDirichlet distribution is obeyed, is made Generate theme, word and classification respectively for polynomial parameters, α, β and γ are the Study firsts of corresponding Dirichlet distribution.

S120:SLDA-TC topic model parameter Estimation, training SLDA-TC textual classification model, establishes theme and classification Between mapping.

LDA using Gibbs Sampling algorithm estimation parameter beAnd the parameter that θ, SLDA-TC model need to estimate It isθ and δ.The disclosure proposes SLDA-TC-Gibbs algorithm on the basis of Gibbs Sampling method, does not count directly Calculate θ_m、And δ_k, first the implicit theme of each word is sampled, after determining the implicit theme of each word, θ_m、And δ_k The frequency can be counted to be calculated.

Gibbs Sampling algorithm measures z to word w mono- hidden variation every time_i=k is sampled, and other words point of z are kept Amount isIt is constant, it calculates such as formula (1).

SLDA-TC-Gibbs algorithm is every time to word w_iThe hidden variation of one of=t measures z_iWhen=k is sampled, its of z is kept Its word component isIt is constant, while also keepingIt is constant, it calculates such as formula (2).

Wherein, w=(w₁,...,w_m) it is all document term vectors, z=(z₁,...,z_V) it is that (V is word word to theme vector Allusion quotation length), y=(y₁,...y_M) it is categorization vector.Assuming that i-th of word w_i=t, z_iIndicate the corresponding theme of i-th of word Variable,Indicate i-th of rejecting vector z.I-th of word w is assumed simultaneously_iDocument where=t is m document w_m, and its Class is marked as y_m=j, j ∈ [1, C], thenIndicate m of rejecting vector y.Indicate that theme k distributes to time of word v Number, β_vIndicate the Dirichlet priori of word v.Indicate that document m distributes to the number of theme z, α_zIt is the Dirichlet of theme z Priori.Indicate i-th (i.e. i-th of word w of rejecting z_i=t), theme k distributes to the number of word t, β_vIt is word t Dirichlet priori.It indicates to reject m (m document w_mClass be labeled as y_m=j), theme k is assigned as classification j Number of files, γ_jIndicate the Dirichlet Study first of classification j.

Contrast (1) introduces on formula (2) left sideThe right increases by oneTo in m documents word it is hidden The sampling containing theme is limited, i.e., only carries out implicit theme from other Training documents identical with m document classifications and adopt Sample because the theme distribution of the document of the same category be it is similar, to formula (2) when having classification mark more rationally.This public affairs It opens and only generates a SLDA-TC model, need to only estimate one group of parameterθ and δ.The derivation of formula (2) proves as follows:

It proves: given Training document collectionEnable w=(w₁,...,w_m), y=(y₁,...y_M), z= (z₁,...,z_V), the Joint Distribution such as (3) formula of SLDA-TC probabilistic model.

P (w, z, y | α, β, γ)=p (w | z, β) p (z | α) p (y | z, γ) (3)

It is distributed from Dirichlet:

Wherein,Γ () is Gamma Function.

According to SLDA-TC-Gibbs algorithm,

To the word w of m documents_i=t,It is constant, therefore

By (3)-(6) and (8), can obtain

(2) formula can be demonstrate,proved as a result,.

After the theme z label for obtaining each word w, the parameter of SLDA-TC modelθ_mAnd δ_kIt calculates as follows.

WhereinIndicate that theme k distributes to the probability of word t, θ_m,kIndicate that document m distributes the probability for the k that is the theme, δ_k,jTable Show that theme k belongs to the probability of classification j,To indicate that theme k distributes to the number of word t,Indicate that document m distributes to theme k Number,Indicate that theme is the number of files that k is assigned as classification j, k=1..K, t=1..V, m=1..M.

Thus the parameter of SLDA-TC modelθ and δ estimation finishes.

SLDATC-Gibbs algorithm description:

Algorithm:

Input: document vectorHyper parameter α, β, γ, number of topics K, the number of iterations T.

Output: theme z distribution, parameterθ and δ

Initialization: variableIt is initialized as 0.

After training generates SLDA-TC model, formula such as (13) are inferred to the implicit theme of each word in d new documents.

Wherein,Indicate the term vector of d new documents,Its theme vector is represented,Indicate that new document rejects i-th Item (i.e. i-th of word), theme k distributes to the number of word t,It indicates to reject i-th, d new documents distribute to master Inscribe the number of k, other symbol meaning reference formulas (2).

The implicit theme label of the new each word of document d is calculated by formula (13), then calculates the probability that d belongs to each theme Such as formula (14).

The theme probability distribution of new document d is obtained as a result,

S130: the SLDA-TC textual classification model that text input to be measured is completed to training is inferred to text to be measured Theme, and then predict text classification.

It is given to train SLDA-TC model, it enablesFor the term vector of d new documents,For its theme vector,It is pair The class prediction of new document d, the classified calculating of new document are as follows.

It is assumed that test sample collection and training sample set are obeyed with distribution, i.e., the two is in implicit theme-category distribution Unanimously, therefore,It can be replaced by p (y | z), be disclosed by the parameter δ of SLDA-TC model.It is new document d Theme probability distributionBy formula (12), (14) and (15), can obtain:

Similarity between theme is assessed with JS divergence (Jensen-Shannon divergence), and JS divergence is also referred to as JS Distance is a kind of deformation of KL divergence (Kullback-Leibler divergence), calculates such as formula (17).Different from KL Divergence, JS divergence meet symmetry and triangle range formula.

JS(p_i||p_j)=0.5*KL (p_i||(p_i+p_j)/2)+0.5*KL(p_j||(p_i+p_j)/2) (17)

Wherein, p_iAnd p_jThe Word probability distribution of theme i and theme j is respectively indicated, the codomain range of JS divergence is [0,1], 0 Indicate p_iAnd p_jIt is distributed identical, 1 indicates opposite.

Semantic relevancy between theme and classification is measured by the parameter δ of SLDA-TC model, is calculated as shown in formula (12), δ_k,jIndicate that theme k belongs to the probability of classification j.

In one or more embodiments, the SLDA-TC model for classification, iteration are generated by successive ignition training Terminate, the similarity between theme assessed by JS divergence, by theme-category distribution parameter evaluation theme of SLDA-TC with Semantic relevancy between classification；The macro average nicety of grading (Macro- of the text classification evaluation of result index of SLDA-TC model Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1).

Experimental analysis verifying:

Tri- data subsets of rec, sci and talk of English data set 20newsgroup are chosen, and include IT, army Thing, education, 5 classes of tourism and finance and economics search dog Chinese corpus data subset, training sample and test specimens in each data subset This ratio is 8:2, and data subset description is as described in Table 1.The participle of Chinese and English data set is segmented using jieba, and English stem mentions It takes using nltk.stem, after removing stop words, feature selecting is carried out using TF-IDF, retains 60% feature in experiment Word.

The description of 1. data set of table

Data set	Training text number	Classification number	Characteristic
				20news-rec	3979	4	47067
20news-sci	2373	4	57943
				20news-talk	1676	3	40945
sogou	2445	5	70819

For the validity for verifying proposed method, experiment is carried out to tri- algorithms of SLDA-TC, LDA-TC and SVM and has been compared. SLDA-TC is the algorithm proposed in the disclosure, and LDA-TC is in traditional LDA Direct Classification, and SVM is K using LDA model Svm classifier algorithm of the theme as feature.

The target that the theme of SLDA-TC model is inferred is the mapping established between theme and classification, number of topics K and label The classification number C of training set is related, and experiment shows K only and need to take a value slightly larger than C, and SLDA-TC can reach to be divided well Class precision.JS distance, preceding 10 features of each theme in experiment, the theme that we generate SLDA-TC topic model Similarity between the probability distribution of word, class and theme is analyzed.Sogou data subset C=is described such as 2~table of table 4 5, the experimental result of the SLDA-TC topic model generated when K=8,6~table of table 8 describe 20news-talk data subset On experimental result, wherein α, β and γ value of SLDA-TC topic model be 0.01.

JS divergence (SLDA-TC, sogou, C=5, K=8) between 2 theme of table

The degree of correlation (SLDA-TC, sogou, C=5, K=8) of 3 theme of table and class

The probability distribution (SLDA-TC, sogou, C=5, K=8) of 10 words before 4 theme of table

As shown in table 2, the JS divergence between theme 2,5,7 is 0, illustrates that they are the identical themes of distribution, can from table 4 To find out, the probability distribution of 10 Feature Words is also identical before this 3 themes.JS divergence between other 5 themes exists Between 0.45 to 0.58, explanation is 5 different themes of dispersion.As can be seen from tables 3 and 4 that 0 mapclass of theme " IT ", 1 mapclass of theme " tourism ", 3 mapclass of theme " finance and economics ", 4 mapclass of theme " military affairs ", " religion of 6 mapclass of theme Educate ", similarity is 99% or more, and theme 2,5,7 is, referred to as " useless " theme unrelated with any classification.

JS divergence (SLDA-TC 20news-talk, C=3, K=6) between 5. theme of table

The degree of correlation (SLDA-TC, 20news-talk, C=3, K=6) of 6 class of table and theme

The probability distribution (SLDA-TC, 20news-talk, C=3, K=6) of 10 words before 7 theme of table

It is the experimental result on 20news-talk shown in 5~table of table 7, theme 1,3 and 4 is that distribution is identical " useless " Theme, the corresponding classification talk.politics.guns of significant theme 0, the corresponding talk.politics.misc of theme 2, master The corresponding talk.politics.mideast of topic 5.

By many experiments it was found that as K-C >=2, the " useless " main of the theme that K-C JS divergence is 0 can be generated Topic, as long as therefore K selection be slightly larger than C value, so the K of SLDA-TC topic model be easy determination.SLDA-TC master Topic model can filter out K-C " useless " themes, establish the accurate mapping of theme and classification.Meanwhile K is only slightly larger than C, far K value lower than LDA, can significantly reduce the training time of model.

Fig. 4 (a)~Fig. 7 (c) describe sogou Chinese, 20newsgroup 3 Sub Data Sets on, by TF- IDF feature selecting retains 60% Feature Words, and number of topics K takes 10 generation SLDA models of different value iteration and LDA model, The classification results of SLDA-TC, LDA-TC and SVM compare.

In Fig. 4 (a)~Fig. 7 (c), abscissa indicates the number of iterations.

When Fig. 4 (a)~Fig. 7 (c) describes that number of topics K takes different value on 4 data sets, SLDA-TC, LDA-TC and SVM Classification results compare.

(1) K need to only be slightly larger than classification number C when the theme of SLDA-TC, and classification results are better than LDA-TC and SVM.

As shown in Fig. 4 (a)-Fig. 4 (c), when 20news-rec data set C=4, K=8, for Macro-Precision, Macro-Recall and Macro-F1 classification indicators, SLDA-TC 95.10%, 94.99% and 94.98%, and LDA-TC is 63.76%, 60.91% and 60.33%, SVM 68.82%, 68.33%, 68.08%.As K increases, LDA-TC and SVM It increases, when K=80, it is 71.85%, 71.38% and 71.41% that LDA-TC, which is up to, and SVM is up to 83.90%, 83.70% and 83.62%, also below SLDA-TC.In addition the training time of K=80, topic model are much high In the training time of SLDA-TC model K=6.

Shown in as shown in Fig. 5 (a)-Fig. 5 (c), when sogou data set C=5, K=8, for Macro-Precision, Macro-Recall and Macro-F1 classification indicators, SLDA-TC 92.80%, 92.73% and 92.70%, and LDA-TC is 72.67%, 68.89% and 67.48%, SVM 80.69%, 80.40%, 80.28%.With the increase of K, the classification of SVM Index steps up, and when K is greater than 60, three kinds of classification indicators have risen to 89.26%, 89.95%, 89.24%, but also 92.80%, 92.73% and the 92.70% of SLDA-TC when being less than K=8, and the SVM (LDA theme is characterized) of K=60, More time costs are paid than the SLDA-TC of K=8.

(2) the K value of SLDA-TC is not the bigger the better, and when K is very big, three classification indicators of SLDA-TC are instead Decline, this shows that K need to only be slightly larger than C.

Shown in as shown in Fig. 6 (a)-Fig. 6 (c), 20news-sci data set C=4, as K=90, SLDA-TC classification knot Fruit is instead very poor.Reason is classification number C=4, and only 4 themes are related to classification in 90 themes of generation, remaining 86 It is " useless " theme, classification is not helped, interference is caused instead, classification results is caused to be deteriorated.Such as Fig. 7 (a)-Fig. 7 (c) Shown in 20news-talk data set C=3, K=90 when be also such.Largely the experimental results showed that, SLDA-TC is calculated Method, K take the value slightly larger than C, can be obtained very high nicety of grading.

SLDA-TC is as shown in table 8 compared with LDA-TC, SVM time performance and classification results on different data sets.

Table 8.SLDA-TC is compared with LDA-TC, SVM time performance and classification results

The generation time of topic model is directly proportional to number of topics K, and K value is bigger, and time cost is higher.To SLDA-TC mould Type, number of topics K only need to can be obtained extraordinary classification results slightly larger than classification number C, and LDA-TC and use LDA theme for The SVM algorithm of feature then needs K value to reach tens, Shang Baishi, could obtain preferable classification results, from Fig. 4 (a)-Fig. 7 (d) Shown in experimental result it is also seen that.

As shown in table 8, on 20news-rec, LDA and SLDA-TC model generation time ratio are 4.86, i.e. SLDA ratio Fast 4.86 times of LDA, because the K of the two is respectively 200 and 8.On 20news-sci, 20news-talk and sogou data set, Fast 4.78,5.16 and 4.90 times of SLDA ratio LDA difference.Meanwhile from Macro-Precision, Macro-Recall and Macro- From the point of view of F1 classification indicators, SLDA-TC algorithm is substantially better than LDA-TC and SVM algorithm, on 4 kinds of data sets, SLDA-TC ratio SVM high Out 3.10%~9.30%, 7.10%~34.08% is higher by than LDA-TC.

In conclusion SLDA-TC model number of topics K need to only take the value slightly larger than classification number C, can identify close with classification Relevant theme is cut, and is all substantially better than LDA-TC in nicety of grading and time performance and LDA theme is used to be characterized SVM algorithm.

The disclosure the text classification the problem of, proposes a kind of based on there is supervision topic model for LDA SLDA-TC textual classification model proposes SLDA-TC-Gibbs parameter estimation algorithm, every time to word w_iA hidden change of=t Component z_iWhen=k is sampled, other word components of z are kept i.e.It is constant, while also keepingIt is constant, i.e., only from the word institute Implicit theme sampling is carried out in the identical other Training documents of document class label, because of the theme of the document of the same category point Cloth is similar, and gives theoretical proof.SLDA-TC model introduces theme-class probability distribution parameter δ, passes throughθ and δ probability distribution extracts the semantic information implied between word and theme, document and theme, theme and classification and maps.Separately Outside, SLDA-TC number of topics K need to only be slightly larger than the value of classification number.Experiment shows that SLDA-TC model can significantly improve classification essence Degree and time efficiency.

One or more other embodiments of the present disclosure, additionally provide a kind of Text Classification System, including text input device, Controller and display device, the controller include memory and processor, and the memory is stored with computer program, institute It states and can be realized following steps as shown in Figure 1 when program is executed by processor:

(1) SLDA-TC textual classification model is constructed；Different from unsupervised LDA model, SLDA-TC textual classification model Each document of Training document collection has class label；The parameter for needing to estimate in SLDA-TC textual classification model not only includes Text-theme probability distribution, theme-Word probability distribution further include theme-class probability distribution；Wherein, text subject probability Dirichlet distribution is obeyed in distribution, the Word probability distribution of theme and the class probability distribution of theme.

(2) training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation.

Specifically, during training SLDA-TC textual classification model, number of topics K is first set, is taken slightly larger than classification The value of number C；Then it is sampled according to implicit theme of the SLDA-TC-Gibbs algorithm to each word, and only where with the word Implicit theme sampling is carried out in the identical other training texts of text categories label；After determining the implicit theme of each word, By counting theme-word, document-theme, theme-classification frequency, text-theme probability distribution, theme-word is calculated Probability distribution and theme-class probability distribution；Establish the accurate mapping between theme and classification；

(3) text subject to be measured infers and classifies.The SLDA-TC text classification that text input to be measured is completed to training Model carries out implicit theme sampling to each word of document to be measured first；Then infer the theme probability distribution of text to be measured；Root According to the theme distribution of document to be measured and theme-category distribution of SLDA-TC model, the class label of text to be measured is exported.

(4) SLDA-TC model and classification results assessment.To the SLDA-TC model that successive ignition training generates, pass through JS Divergence assesses the similarity between theme, by the theme-of SLDA-TC between category distribution parameter evaluation theme and classification Semantic relevancy；The text classification evaluation of result of SLDA-TC model is by the macro average nicety of grading (Macro- of index Precision), macro average recall rate (Macro-Recall) and macro average F1 value (Macro-F1) assessment.

The file classification method and system of the disclosure, by constructing and training the SLDA-TC textual classification model of completion, It is distributed using the Word probability distribution of text subject probability distribution, theme and the class probability of theme, extracts word and theme, document The semantic information mapping implied between theme, theme and classification, improves text classification precision and time efficiency.

It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer journey Sequence product.Therefore, hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure Form.It is deposited moreover, the disclosure can be used to can be used in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on storage media (including but not limited to magnetic disk storage and optical memory etc.).

The disclosure be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions each in flowchart and/or the block diagram The combination of process and/or box in process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute For realizing the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram Device.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that instruction stored in the computer readable memory generation includes The manufacture of command device, the command device are realized in one box of one or more flows of the flowchart and/or block diagram Or the function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that Series of operation steps are executed on computer or other programmable devices to generate computer implemented processing, thus calculating The instruction executed on machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side The step of function of being specified in block diagram one box or multiple boxes.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can It is completed with instructing relevant hardware by computer program, the program can be stored in a computer-readable storage In medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can For magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..

Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, not the disclosure is protected The limitation of range, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art Member does not need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.

Claims

1. a kind of based on the file classification method for having supervision topic model characterized by comprising

SLDA-TC textual classification model is constructed, each document of the Training document collection of SLDA-TC textual classification model has classification Label；The parameter for needing to estimate in SLDA-TC textual classification model not only includes text-theme probability distribution, theme-Word probability Distribution further includes theme-class probability distribution；

Training SLDA-TC textual classification model carries out SLDA-TC model parameter estimation according to SLDA-TC-Gibbs algorithm；Its In, the process of SLDA-TC model parameter estimation is carried out according to SLDA-TC-Gibbs algorithm are as follows: to the implicit theme of each word into Row sampling, and implicit theme sampling is only carried out from other training texts identical with text categories label where the word；True After the implicit theme of fixed each word, by counting theme-word, document-theme, theme-classification frequency, text is calculated Sheet-theme probability distribution, theme-Word probability distribution and theme-class probability distribution, and then establish out between theme and classification Accurate mapping；

Text subject to be measured is inferred and classification；The SLDA-TC textual classification model that text input to be measured is completed to training, first Implicit theme sampling is carried out to each word of document to be measured；Then infer the theme probability distribution of text to be measured；According to document to be measured Theme distribution and SLDA-TC model theme-category distribution, export the class label of text to be measured.

2. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that SLDA- The parameter for needing to estimate in TC textual classification model not only includes text-theme probability distribution, theme-Word probability distribution, is also wrapped Theme-class probability distribution is included, parameter obeys Dirichlet distribution.

3. as described in claim 1 a kind of based on the file classification method for having supervision topic model, which is characterized in that by more Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.

4. as claimed in claim 3 a kind of based on the file classification method for having supervision topic model, which is characterized in that the text This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.

5. a kind of based on the Text Classification System for having supervision topic model, including text input device, controller and display device, The controller includes memory and processor, which is characterized in that the memory is stored with computer program, described program quilt Processor can be realized following steps when executing:

6. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that text master It inscribes probability distribution, the Word probability distribution of theme and the class probability distribution of theme and obeys Dirichlet distribution.

7. as claimed in claim 5 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that by more Secondary repetitive exercise generates the SLDA-TC model for being used for text classification, and iteration terminates, and is assessed by JS divergence similar between theme Degree, passes through semantic relevancy of the theme-of SLDA-TC between category distribution parameter evaluation theme and classification.

8. as claimed in claim 7 a kind of based on the Text Classification System for having supervision topic model, which is characterized in that the text This classification evaluation of result index includes macro average nicety of grading, macro average recall rate and macro average F1 value.