CN108090042A

CN108090042A - For identifying the method and apparatus of text subject

Info

Publication number: CN108090042A
Application number: CN201611051277.2A
Authority: CN
Inventors: 张帅
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2018-05-29

Abstract

This application discloses for identifying the method and apparatus of text subject.One specific embodiment of this method includes：Text to be identified is pre-processed to obtain keyword set；The theme belonging to each keyword in random definite keyword set；Count the number for the keyword that each theme includes；Following steps are repeated to each keyword in keyword set, until result restrains or reach default iterations：The number for the keyword that the affiliated theme of keyword includes is subtracted one；It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the number for the keyword that the theme that sampling obtains includes is added one；The number of keyword determines that each theme appears in the probability in text to be identified in the number and keyword set of the keyword included according to each theme.This embodiment improves the accuracy of text subject identification.

Description

For identifying the method and apparatus of text subject

Technical field

This application involves field of computer technology, and in particular to semantic analysis field more particularly, to identifies text master The method and apparatus of topic.

Background technology

With the fast development of internet, more and more users sharing into row information by network, since information is more And it is miscellaneous, how using these information carry out Analysis of Policy Making have become one it is important the problem of.For example, in e-commerce field, More and more consumers start shopping online, and shopping experience is evaluated, and this commodity evaluating data increases in explosion type It is long, and these comment datas can provide abundant decision references for electric business and consumer.Therefore, it is necessary to from comment data Text feature sets out, and rapidly and efficiently identifies the serviceability of comment data, finds the opinion and attitude of consumer, market of going forward side by side Sense analysis prediction, and Text character extraction is then the committed step of text mining.

At present, text feature is typically the method with statistics or information theory, is picked out and category label The keyword most shown is as characteristic set.They are mostly established with based on bag of words (Bag of Words, BOW) model, first The first extracting keywords from text, then using some assessment strategy algorithms, (the anti-text frequency TF-IDF of such as text, information increase Benefit, mutual information etc.) keyword of most worthy is picked out as feature vector.But the feature vector dimension that this method obtains Height, for short texts such as comments, the feature vector generated by BOW models will be a very sparse vector, after adding The difficulty of continuous text-processing；In addition, the short texts such as comment have the characteristics that theme is indefinite, the treatment effect of BOW can be also influenced. Therefore, such method, the feature extraction for short texts such as comments, the feature vector of extraction is ineffective, identifies text master The poor accuracy of topic.

The content of the invention

The purpose of the application is to propose a kind of improved method and apparatus for identifying text subject, more than solving The technical issues of background section is mentioned.

In a first aspect, this application provides a kind of method for identifying text subject, the described method includes：To be identified Text is pre-processed to obtain keyword set；The theme belonging to each keyword in the keyword set is determined at random； Count the number for the keyword that each theme includes；Following step is repeated to each keyword in the keyword set Suddenly, until result restrains or reaches default iterations, wherein, result convergence, which includes repeating following steps, to be obtained Each theme keyword distribution variable quantity be less than predetermined threshold：By the number for the keyword that the affiliated theme of keyword includes Subtract one；It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme bag that sampling is obtained The number of the keyword included adds one；The number of the keyword included according to each theme and keyword in the keyword set Number determines that each theme appears in the probability in text to be identified.

In some embodiments, the method further includes the step of training obtains probability distribution, wherein, the training obtains The step of probability distribution, including：History text set is obtained, wherein, the history text set includes at least one history text This subset, the history text subset are to institute according to the generated time of text and the quantity of text in the history text set State what history text set divided；History where model training obtains the text to be identified is generated by text subject The keyword distribution of each theme in text subset.

In some embodiments, it is each in the history text subset by text subject generation model training acquisition The keyword distribution of theme, including：Model training is generated by text subject and obtains text generation time earliest history text The keyword distribution of each theme in subset；The keyword distribution of each theme in the history text subset obtained based on training, According to the generated time of text in the history text subset, it is determined in addition to the text generation time earliest subset successively The theme distribution of text and the keyword of each theme are distributed in his subset.

In some embodiments, it is described that model training acquisition text generation time earliest subset is generated by text subject In each theme keyword distribution, including：For text in text generation time earliest subset, following steps are performed, directly To the generation text：For each theme, a multinomial distribution is sampled out as the theme from the distribution of the first Di Li Crays Distribution on keyword；Stochastical sampling goes out length of the value as the text from a discrete probability distribution；From second It samples out distribution of the multinomial distribution as the text on theme in the distribution of Di Li Crays；For each in the text Keyword, a theme of sampling out from distribution of the text on theme, then point from the theme sampled out on keyword It samples out in cloth a keyword.

In some embodiments, the method further includes：Calculate the word frequency of each keyword in the keyword set-inverse To document-frequency value；It is small in response to the word frequency-reverse document-frequency value and the ratio for the number that the keyword appears in subset In predetermined threshold, then the keyword is added in into deactivated vocabulary；And the pretreatment is for participle and according to the stop words Table deletes stop words.

Second aspect, this application provides a kind of for identifying the device of text subject, described device includes：Pretreatment is single Member is configured to that text to be identified is pre-processed to obtain keyword set；Theme determination unit is configured to determine at random The theme belonging to each keyword in the keyword set；Statistic unit is configured to count the pass that each theme includes The number of keyword；Sampling unit is configured to repeat following steps to each keyword in the keyword set, directly Restrained to result or reach default iterations, wherein, result convergence include repeating following steps obtain it is each The variable quantity of the keyword distribution of a theme is less than predetermined threshold：The number for the keyword that the affiliated theme of keyword includes is subtracted One；It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme that sampling obtains is included The number of keyword add one；Probability determining unit, be configured to the number of the keyword included according to each theme with it is described The number of keyword determines that each theme appears in the probability in text to be identified in keyword set.

In some embodiments, described device further includes training unit, the training unit, including：Subelement is obtained, is matched somebody with somebody It puts to obtain history text set, wherein, the history text set includes at least one history text subset, the history Text subset is to the history text set according to the generated time of text in the history text set and the quantity of text What division obtained；Training subelement is configured to text subject generation model training and obtains the text place to be identified History text subset in each theme keyword distribution.

In some embodiments, the trained subelement, is further configured to：Model training is generated by text subject Obtain the keyword distribution of each theme in text generation time earliest history text subset；The history text obtained based on training Book concentrates the keyword distribution of each theme, according to the generated time of text in the history text subset, determines to remove successively The theme distribution of text is distributed with the keyword of each theme in other outer subsets of the text generation time earliest subset.

In some embodiments, the trained subelement, is further configured to：For the son that the text generation time is earliest Text is concentrated, performs following steps, until generating the text：For each theme, sample out from the distribution of the first Di Li Crays Distribution of one multinomial distribution as the theme on keyword；Stochastical sampling goes out a value from a discrete probability distribution Length as the text；From the second Di Li Crays distribution in sample out a multinomial distribution as the text on theme Distribution；For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling The theme gone out is sampled out a keyword in the distribution on keyword.

In some embodiments, the pretreatment deletes stop words for participle and according to the deactivated vocabulary；And institute Device is stated to further include：Computing unit is configured to calculate word frequency-reverse file frequency of each keyword in the keyword set Rate value；Unit is added in, is configured to appear in the number of subset with the keyword in response to the word frequency-reverse document-frequency value Ratio be less than predetermined threshold, then the keyword is added in into deactivated vocabulary.

The application provide for the method and apparatus that identify text subject, by text to be identified pre-process To keyword set, the theme belonging to the random each keyword determined in keyword set, and count what each theme included The number of keyword then repeats following steps to each keyword in keyword set, until result restrains or reaches To default iterations：The number for the keyword that the affiliated theme of keyword includes is subtracted one, is obtained according to advance training general Rate distribution is sampled to obtain the affiliated theme of keyword, the number for the keyword that the theme that sampling obtains includes is added one, finally The number of keyword determines that each theme appears in and treats in the number and keyword set of the keyword included according to each theme It identifies the probability in text, improves the accuracy of text subject identification.

Description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, the application's is other Feature, objects and advantages will become more apparent upon：

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart for being used to identify one embodiment of the method for text subject according to the application；

Fig. 3 is the schematic diagram for being used to identify an application scenarios of the method for text subject according to the application；

Fig. 4 is the flow chart for being used to identify another embodiment of the method for text subject according to the application；

Fig. 5 is the structure diagram for being used to identify one embodiment of the device of text subject according to the application；

Fig. 6 is adapted for the structure diagram of the computer system of the server for realizing the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention rather than the restriction to the invention.It also should be noted that in order to Convenient for description, illustrated only in attached drawing and invent relevant part with related.

It should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1, which is shown, to identify the exemplary of the embodiment of the method or apparatus of text subject using the application System architecture 100.

As shown in Figure 1, system architecture 100 can include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 provide transmission link medium.Network 104 can be with Including various connection types, such as wired, wireless transmission link or fiber optic cables etc..

User can be interacted with using terminal equipment 101,102,103 by network 104 with server 105, to receive or send out Send message etc..Various applications can be installed on terminal device 101,102,103, for example, the application of e-commerce class, instant messaging Class application, the application of browser class, searching class application, the application of word processing class etc..

Terminal device 101,102,103 can be various electronic equipments, include but not limited to smart mobile phone, tablet computer, E-book reader, pocket computer on knee and desktop computer etc..

Server 105 can be the background server that support is provided for the application installed on terminal device 101,102,103, For example, server 105 can obtain the text to be identified of the upload of terminal device 101,102,103, by pre-processing and sampling Step determines that each theme appears in the probability in text to be identified, can also further root so as to identify the theme of text Category filter etc. is carried out to text according to theme；Server 105 equally can also be to the text to be identified that is stored in other servers The identification of theme is carried out, is asked afterwards in response to terminal device 101,102,103 for the view as text of a certain theme, to terminal Equipment 101,102,103 sends the text of the theme.

It should be noted that the method for being used to identify text subject that the embodiment of the present application is provided is generally by server 105 perform, and correspondingly, the device for identifying text subject is generally positioned in server 105.

It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need Will, can have any number of terminal device, network and server.

It please refers to Fig.2, it illustrates the streams for being used to identify one embodiment of the method for text subject according to the application Journey 200.It should be noted that the method for being used to identify text subject that the embodiment of the present application is provided is generally by the clothes in Fig. 1 Business device 105 performs.This method comprises the following steps：

Step 201, text to be identified is pre-processed to obtain keyword set.

In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1 Server) can text to be identified be obtained at terminal or other servers by wired connection mode or radio connection This, and to being pre-processed to obtain keyword set to text to be identified.Text to be identified can be user in shopping website The text that user issues in comment text or social networks.Pretreatment can include participle, delete the operations such as stop words.Keyword Set can be presented in a manner of word sequence.Each keyword in keyword set can be considered as by " with certain Such a process of some theme of probability selection, and from this theme with some word of certain probability selection " obtains.

Step 202, the theme belonging to each keyword in keyword set is determined at random.

In the present embodiment, based on the keyword set obtained in step 201, above-mentioned electronic equipment can be random true first Determine the theme belonging to each keyword in keyword set, be equivalent to and be sampled for the theme of text, with each key Theme belonging to word assigns initial value.Can pre-set the number of theme, the setting principle of theme number can be each theme it Between similarity it is smaller, theme number is better.

Step 203, the number for the keyword that each theme includes is counted.

In the present embodiment, based on each keyword determined in step 202 belonging to theme, above-mentioned electronic equipment can be with Count the number for the keyword that each theme includes.

Step 204, following steps are repeated to each keyword in keyword set, until result restrains or reaches Default iterations：The number for the keyword that the affiliated theme of keyword includes is subtracted one；The probability obtained according to advance training Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one.

In the present embodiment, based on the number that the keyword that each theme includes is counted in step 203, above-mentioned electronic equipment Following steps can be repeated to each keyword in keyword set, until result restrains or reach default iteration time Number：The number for the keyword that the affiliated theme of keyword includes is subtracted one；It is sampled according to the probability distribution that advance training obtains The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, wherein, as a result convergence includes weight The variable quantity for performing the keyword distribution for each theme that following steps obtain again is less than predetermined threshold, the keyword of each theme The variable quantity of distribution can be the variable quantity of the number for the keyword that each main body includes or the variable quantity of species.This step can be with It is that approximate solution is carried out by Method of Stochastic, as an example it is supposed that the keyword of the theme distribution of text and theme point Cloth is the bi-distribution that generation is sampled from the Di Li Crays distribution that parameter is α and parameter is β respectively, then can be according to following public affairs Formula is sampled：

Wherein, p (z_i=k | m, α, β, t) represent that the theme of i-th of word in text to be identified is the theme the probability of k, t refers to text Timeslice residing for this, since history text subset was divided according to the time, the timeslice residing for text can be according to its institute Place's timeslice is identified, θ_m,kRepresent that theme k appears in the probability in text m,Represent that theme k includes word z_iProbability, n_mRepresent the sum of word in text m to be identified.V_tRepresent the sum of word in timeslice t text sets.Represent that theme k goes out in text m Existing number,Represent that word v appears in the number of theme k, n_kRepresent the sum for the word that theme k is included, parameter is α and ginseng Number can be α=50/K for the value of β, and β=0.01, wherein K are the sum of pre-set theme.

Sampling can use gibbs sampler algorithm, gibbs sampler be MCMC (Markov Chain Monte Carlo, Markov Mo Te Carlows) one kind in algorithm, for constructing the random sample of multivariate probability distribution, for example, construction two or The Joint Distribution of multiple variables, quadratures, and it is expected.Gibbs sampler needs a convergent process, is only reaching equilibrium-like The sample of target distribution when the sample obtained when state can be just equilibrium state, therefore, the sample generated when not converged It is required for being rejected.How to judge a process whether to have reached equilibrium state that ripe method solves not yet, mesh Preceding common method is to see whether state steadily (such as draws a figure, if during longer, variation is not Greatly, illustrate probably to have balanced).The side such as graphical method or Monte Carlo error may be employed on the convergent method of inspection Method more reaches the iteration of equilibrium state in engineering practice by experience and to the observation of data to specify in gibbs sampler Number.

In some optional realization methods of the present embodiment, the above method further includes the step that training obtains probability distribution Suddenly, wherein, it is above-mentioned training obtain probability distribution the step of, including：History text set is obtained, wherein, the history text collection Conjunction includes at least one history text subset, and the history text subset is the generation according to text in the history text set The quantity of time and text divides the history text set；It is generated by text subject described in model training acquisition The keyword distribution of each theme in history text subset where text to be identified.

In some optional realization methods of the present embodiment, obtained above by text subject generation model training above-mentioned The keyword distribution of each theme in history text subset, including：Model training is generated by text subject and obtains text generation The keyword distribution of each theme in time earliest history text subset；It is each in the history text subset obtained based on training The keyword distribution of theme according to the generated time of text in the history text subset, is determined successively except the text generation The theme distribution of text is distributed with the keyword of each theme in other outer subsets of time earliest subset.

In some optional realization methods of the present embodiment, text is obtained above by text subject generation model training The keyword distribution of each theme in generated time earliest subset, including：For the subset Chinese that the text generation time is earliest This, performs following steps, until generating the text：For each theme, sample out from the distribution of the first Di Li Crays more than one Item formula is distributed the distribution as the theme on keyword；Stochastical sampling goes out a value conduct and is somebody's turn to do from a discrete probability distribution The length of text；It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays； For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling out Theme is sampled out a keyword in the distribution on keyword.

In some optional realization methods of the present embodiment, the above method further includes：It calculates in above-mentioned keyword set The word frequency of each keyword-reverse document-frequency value；It is appeared in response to above-mentioned word frequency-reverse document-frequency value with the keyword The ratio of the number of history text subset is less than predetermined threshold, then above-mentioned keyword is added in deactivated vocabulary；And above-mentioned pre- place It manages as participle and stop words is deleted according to above-mentioned deactivated vocabulary.Predetermined threshold can be set according to actual needs, such as can be with 0.01 is arranged to, TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-reverse can be passed through Document-frequency) algorithm calculates the word frequency of each keyword-reverse document-frequency value.TF (term frequency, word frequency) etc. In the number of word appearance divided by total word number of this document.IDF (inverse document frequency, reverse file Frequency) be a word general importance measurement, the IDF of a certain particular words by general act number divided by can include this The number of the file of word, then obtained business is taken the logarithm to obtain, the calculating of IDF based on the corpus obtained in advance or can be gone through History text collection is whole, and corresponding general act number can be the number of file or history text set in corpus The number of middle text.As an example, for keyword w, ifThen keyword w additions are stopped With vocabulary, wherein, N_wRepresent that word w appears in the number of history text subset, Tf_idf(w) the TF-IDF values of word w are represented, ψ is pre- Determine threshold value.The word frequency of word equally can also be constantly counted, word is more than certain threshold values in the probability that text occurs, is included Deactivated dictionary.

Step 205, the number of keyword determines in the number and keyword set of the keyword included according to each theme Each theme appears in the probability in text to be identified.

In the present embodiment, above-mentioned electronic equipment can be according to obtaining keyword that each theme includes in step 204 Number and the number of keyword in keyword set determine that each theme appears in the probability in text to be identified.It can be according to pass The ratio of each affiliated theme of keyword appears in the probability in text to be identified as theme in keyword set, can also use it More accurately statistical method is calculated for he, as an example it is supposed that the theme distribution of text is from the Di Li Crays that parameter is α The bi-distribution of generation is sampled in distribution, then probability P (the z in text m can be appeared according to the following formula calculating theme k_i=k | m)：

Wherein, θ_m,kRepresent that theme k appears in the probability in text m, n_mRepresent the sum of word in text m to be identified.V_tTable Show the sum of word in timeslice T text sets.Represent the number that theme k occurs in text m,Represent that word v appears in theme k Number, n_kRepresent the sum for the word that theme k is included.

With continued reference to Fig. 3, Fig. 3 is one that is used to identify the application scenarios of the method for text subject according to the present embodiment Schematic diagram.In the application scenarios of Fig. 3, for identify the frame 300 of text subject mainly include training obtain probability distribution and The theme two parts of identification comment in real time.Wherein, the step of training acquisition probability distribution includes：Step 301, in response to historical review The quantity of training comment does not reach preset value in set, not training comment is obtained as training comment set is treated, wherein, it is necessary to real When identification comment can also be added in historical review set.Step 302, obtain treating that training is commented using text subject generation model The keyword distribution of each theme during analects closes.Step 303, according to comment the time earlier than treat training comment set in all comments Comment set in each theme keyword distribution of the keyword distribution with treating each theme in training comment set determine it is real When identification needed for each theme keyword distribution.As an example, the quantity that training is not commented in historical review set is for the first time Reach preset value, generating model by text subject for the first time obtains the keyword distribution of each themeHistorical review set In the quantity of training comment reach preset value for the second time, generating model by text subject obtains the keyword point of each theme ClothIt is comprehensiveWithObtain historical review set currently each theme keyword distributionπ can be with Value is carried out in the range of 0 to 1, and so on, smoothing processing is made to the distribution that training obtains every time.

The step of identification comment theme includes in real time：Step 304, pretreatment obtains keyword set.Step 305, at random Determine the theme belonging to each keyword in the keyword set.Step 306, each keyword in keyword set is held The number for the keyword that the affiliated theme of keyword includes is subtracted one by row step 3061；Step 3062, sampling redefine the key The affiliated theme of word, sampling can be carried out according to the probability distribution trained in advance, since identification comment in real time being needed also to add in It has arrived in historical review set, should also include what is obtained after it is pre-processed in the keyword distribution of each main body trained in advance Keyword；Step 3063, the number for the keyword that the theme that sampling obtains includes is added one.Step 307, whether judging result is received It holds back or whether reaches default iterations, if yes then enter step 308, if otherwise return to step 306.Step 308, calculate Each theme appears in the probability in comment to be identified.

The method that above-described embodiment of the application provides by being pre-processed to obtain keyword set to text to be identified, The theme belonging to each keyword in random definite keyword set, and the number for the keyword that each theme includes is counted, Following steps then are repeated to each keyword in keyword set, until result restrains or reach default iteration time Number：The number for the keyword that the affiliated theme of keyword includes is subtracted one, is sampled according to the probability distribution that advance training obtains The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, finally according to each theme bag It is general in text to be identified to determine that each theme appears in for the number of keyword in the number and keyword set of the keyword included Rate improves the accuracy of text subject identification.

With further reference to Fig. 4, it illustrates for identifying the flow 400 of another embodiment of the method for text subject. This is used for the flow 400 for identifying the method for text subject, comprises the following steps：

Step 401, history text set is obtained.

In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1 Server) history text set can be obtained.Wherein, the history text set includes at least one history text subset, institute It is to history text according to the generated time of text in the history text set and the quantity of text to state history text subset The division of this set obtains.History text set is dynamic, and above-mentioned electronic equipment can be constantly got including text to be identified New text inside, and add it to and training corpus is used as in history text set.

Step 402, model training is generated by text subject to obtain in text generation time earliest history text subset The keyword distribution of each theme.

In the present embodiment, based on each subset in the history text set obtained in step 401, above-mentioned electronic equipment The pass that model training obtains each theme in text generation time earliest history text subset can be generated by text subject Keyword is distributed.The theme distribution of text in history text subset, text master can equally be obtained by generating model by text subject Topic generation model can be LDA (Latent Dirichlet Allocation imply the distribution of Di Li Crays) model, hLDA (hierarchical Latent Dirichlet Allocation are layered implicit Di Li Crays distribution) model, HDP (hierarchical Dirichlet process, layering Di Li Crays process) model or GSDMM (collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture Model are more for Di Li Crays The folding Gibbs model algorithm of item mixed model).As an example, text subject generation model is using LDA models, it is existing to go through History text collection D has carried out sequential division according to the generated time of text, has been divided into multiple subclass D={ d₁,d₂… d_i..., the time span of each subclass is T_i=τ * T₀, for the sequential factor, its value range is τ >=1, T to wherein τ₀On the basis of Span interval.To each subset d_iModel training is carried out, to d_iEvery text m, by LDA obtain the theme of the text to Measure z_m, and count number that word in each theme occurs and comprising word sum.Specific training process is as follows：It will be in text Each word be mapped to probability space, obtain short text explanation vector, recycle Gibbs model method in concept space The upper LDA models that pass through are clustered to explaining vector, text modeling, obtain the probability distribution of " text-theme-word ".Final To the theme vector of text, and using theme as the feature vector of the text.

The text set that the input of text generation algorithm based on LDA models obtains after being pre-processed for antithetical phrase concentration text The parameter alpha of W, Di Li Cray distribution, β and number of topics K export the probability matrix θ for being text on theme and theme in keyword On probability matrixWherein, parameter alpha, the value of β can be α=50/K, β=0.01.Text can be calculated in timeslice T_i θ_iWithIt is a multinomial distribution conduct of sampling out during each theme is distributed from the Di Li Crays that a parameter is β first The keyword distribution of themeThen to every text m, a value is sampled out from a Poisson distribution As text size N_m~Poiss (ξ), then a multinomial distribution of sampling out from the Di Li Crays distribution that a parameter is α Distribution θ as theme under the text_m~Dirichlet (α).It is first main under the text finally, for each word in text One theme z of sampling generation in the multinomial distribution of topic_m,n~Mult (θ_m), then from sampling out under the theme in the multinomial distribution of word One wordThis random generating process is constantly repeated, until generating the full text in text set, together When can count the total N that each word appears in current text collection_w。

Step 403, the keyword distribution of each theme in the history text subset obtained based on training, according to history text The generated time of text in subset, the theme of text divides in other subsets in addition to the definite subset earliest except the text generation time successively The keyword of cloth and each theme is distributed.

In the present embodiment, based on each in the text generation time obtained in step 402 earliest history text subset The keyword distribution of theme, above-mentioned electronic equipment can the keywords based on each theme in the history text subset for training acquisition Distribution according to the generated time of text in history text subset, determines in addition to text generation time earliest subset other successively The theme distribution of text and the keyword of each theme are distributed in subset.As an example, by text generation time earliest history Text subset is as timeslice T₁Corresponding subset can combine piece T between having been obtained in step 402₁The theme of text in corresponding subset It is distributed and the subset of the second morning of text generation time of the distribution of the keyword of theme and text subject generation model training acquisition is Timeslice T₂The keyword of the theme distribution of text and theme is distributed in corresponding subset, obtains the theme distribution θ of newest text It is distributed with the keyword of themeAnd so on, it can be with binding time piece T_i-1The θ of corresponding subset_i-1WithThen probability matrix θ=π θ of the text on theme_i+ (1-π)θ_i-1, probability matrix of the theme on wordIn view of timeslice T₁No pair of corresponding subset The timeslice answered may be referred to, therefore for timeslice T₁Corresponding subset takes π=1.So by drawing sequence concept, smoothly Processing, dynamic update model parameter, can obtain the feature of online text in real time.

Step 404, text to be identified is pre-processed to obtain keyword set.

In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1 Server) can text to be identified be obtained at terminal or other servers by wired connection mode or radio connection This, and to being pre-processed to obtain keyword set to text to be identified.

Step 405, the theme belonging to each keyword in keyword set is determined at random.

In the present embodiment, based on the keyword set obtained in step 404, above-mentioned electronic equipment can be random true first Determine the theme belonging to each keyword in keyword set, be equivalent to and be sampled for the theme of text, with each key Theme belonging to word assigns initial value.Can pre-set the number of theme, the setting principle of theme number can be each theme it Between similarity it is smaller, theme number is better.

Step 406, the number for the keyword that each theme includes is counted.

In the present embodiment, based on each keyword determined in step 405 belonging to theme, above-mentioned electronic equipment can be with Count the number for the keyword that each theme includes.

Step 407, following steps are repeated to each keyword in keyword set, until result restrains or reaches Default iterations：The number for the keyword that the affiliated theme of keyword includes is subtracted one；The probability obtained according to advance training Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one.

In the present embodiment, based on the number that the keyword that each theme includes is counted in step 406, above-mentioned electronic equipment Following steps can be repeated to each keyword in keyword set, until result restrains or reach default iteration time Number：The number for the keyword that the affiliated theme of keyword includes is subtracted one；It is sampled according to the probability distribution that advance training obtains The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, wherein, as a result convergence includes weight The variable quantity for performing the keyword distribution for each theme that following steps obtain again is less than predetermined threshold, the keyword of each theme The variable quantity of distribution can be the variable quantity of the number for the keyword that each main body includes or the variable quantity of species.This step can be with It is that approximate solution is carried out by Method of Stochastic.

Step 408, the number of keyword determines in the number and keyword set of the keyword included according to each theme Each theme appears in the probability in text to be identified.

In the present embodiment, above-mentioned electronic equipment can be according to obtaining keyword that each theme includes in step 407 Number and the number of keyword in keyword set determine that each theme appears in the probability in text to be identified.It can be according to pass The ratio of each affiliated theme of keyword appears in the probability in text to be identified as theme in keyword set, can also use it More accurately statistical method carries out calculating for he

Figure 4, it is seen that unlike embodiment corresponding from Fig. 2, training is added in the present embodiment and is obtained generally The step of rate is distributed to carry out feature extraction to short texts such as similar comments, by calculating the probability of each theme, and is made For the feature vector that short text is final, solve the problems, such as that the feature of short text is high-dimensional and sparse indefinite with feature.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for identifying text One embodiment of the device of this theme, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, which specifically may be used To be applied in various electronic equipments.

Include as shown in figure 5, the present embodiment is above-mentioned for the device 500 that identifies text subject：Pretreatment unit 501, Theme determination unit 502, statistic unit 503, sampling unit 504 and probability determining unit 505.Wherein, pretreatment unit 501, It is configured to that text to be identified is pre-processed to obtain keyword set；Theme determination unit 502 is configured to determine at random The theme belonging to each keyword in the keyword set；Statistic unit 503 is configured to count what each theme included The number of keyword；Sampling unit 504 is configured to repeat following step to each keyword in the keyword set Suddenly, until result restrains or reaches default iterations, wherein, result convergence, which includes repeating following steps, to be obtained Each theme keyword distribution variable quantity be less than predetermined threshold：By the number for the keyword that the affiliated theme of keyword includes Subtract one；It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme bag that sampling is obtained The number of the keyword included adds one；Probability determining unit 505, be configured to the number of the keyword included according to each theme with The number of keyword determines that each theme appears in the probability in text to be identified in the keyword set.

In the present embodiment, for identifying pretreatment unit 501, theme determination unit in the device 500 of text subject 502nd, the specific processing of statistic unit 503, sampling unit 504 and probability determining unit 505 can be corresponded to referring to Fig. 2 in embodiment Step 201, step 202, step 203 step 204 and step 205 realization method associated description, details are not described herein.

In some optional realization methods of the present embodiment, described device further includes training unit (not shown), described Training unit, including：Subelement (not shown) is obtained, is configured to obtain history text set, wherein, the history text collection Conjunction includes at least one history text subset, and the history text subset is the generation according to text in the history text set The quantity of time and text divides the history text set；Training subelement (not shown), is configured to The keyword point of each theme in history text subset where the text subject generation model training acquisition text to be identified Cloth.

In some optional realization methods of the present embodiment, the trained subelement is further configured to：Pass through text This theme generation model training obtains the keyword distribution of each theme in text generation time earliest history text subset；Base The keyword distribution of each theme in the history text subset obtained in training, according to the life of text in the history text subset Into the time, the theme distribution of text and each master in other subsets are determined in addition to the text generation time earliest subset successively The keyword distribution of topic.

In some optional realization methods of the present embodiment, the trained subelement is further configured to：For text Text in this generated time earliest subset performs following steps, until generating the text：For each theme, from first Di It samples out in the distribution of sharp Cray distribution of the multinomial distribution as the theme on keyword；From a discrete probability distribution Middle stochastical sampling goes out length of the value as the text；A multinomial distribution of sampling out from the distribution of the second Di Li Crays is made The distribution for being the text on theme；For each keyword in the text, sample from distribution of the text on theme Go out a theme, then a keyword of sampling out from distribution of the theme sampled out on keyword.

In some optional realization methods of the present embodiment, the pretreatment is for participle and according to the deactivated vocabulary Delete stop words；And described device further includes：Computing unit (not shown) is configured to calculate in the keyword set every The word frequency of a keyword-reverse document-frequency value；Unit (not shown) is added in, is configured in response to the word frequency-reverse text The ratio for the number that part frequency values appear in subset with the keyword is less than predetermined threshold, then the keyword is added in stop words Table.

Below with reference to Fig. 6, it illustrates suitable for being used for realizing the computer system 600 of the server of the embodiment of the present application Structure diagram.

As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into program in random access storage device (RAM) 603 from storage part 608 and Perform various appropriate actions and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

I/O interfaces 605 are connected to lower component：Importation 606 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage part 608 including hard disk etc.； And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net performs communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 610, as needed in order to read from it Computer program be mounted into as needed storage part 608.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product, it is machine readable including being tangibly embodied in Computer program on medium, the computer program are included for the program code of the method shown in execution flow chart.At this In the embodiment of sample, which can be downloaded and installed from network by communications portion 609 and/or from removable Medium 611 is unloaded to be mounted.When the computer program is performed by central processing unit (CPU) 601, perform in the present processes The above-mentioned function of limiting.

Flow chart and block diagram in attached drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey Architectural framework in the cards, function and the operation of sequence product.In this regard, each box in flow chart or block diagram can generation The part of one module of table, program segment or code, a part for the module, program segment or code include one or more The executable instruction of logic function as defined in being used to implement.It should also be noted that some as replace realization in, institute in box The function of mark can also be occurred with being different from the order marked in attached drawing.For example, two boxes succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, this is depending on involved function.Also It is noted that the combination of each box in block diagram and/or flow chart and the box in block diagram and/or flow chart, Ke Yiyong The dedicated hardware based systems of functions or operations as defined in execution is realized or can referred to specialized hardware and computer The combination of order is realized.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be set in the processor, for example, can be described as：A kind of processor bag Include pretreatment unit, theme determination unit, statistic unit, sampling unit and probability determining unit.Wherein, the title of these units The restriction to the unit in itself is not formed under certain conditions, for example, pretreatment unit is also described as " to be identified Text is pre-processed to obtain the unit of keyword set ".

As on the other hand, present invention also provides a kind of nonvolatile computer storage media, the non-volatile calculating Machine storage medium can be nonvolatile computer storage media included in device described in above-described embodiment；Can also be Individualism, without the nonvolatile computer storage media in supplying terminal.Above-mentioned nonvolatile computer storage media is deposited One or more program is contained, when one or more of programs are performed by an equipment so that the equipment：It treats Identification text is pre-processed to obtain keyword set；The theme belonging to each keyword in random definite keyword set； Count the number for the keyword that each theme includes；Following steps are repeated to each keyword in keyword set, directly It is restrained to result or reaches default iterations, wherein, as a result convergence includes repeating each master that following steps obtain The variable quantity of the keyword distribution of topic is less than predetermined threshold：The number for the keyword that the affiliated theme of keyword includes is subtracted one；Root The probability distribution obtained according to advance training is sampled to obtain the affiliated theme of keyword, the key that the theme that sampling obtains is included The number of word adds one；The number of keyword determines each in the number and keyword set of the keyword included according to each theme Theme appears in the probability in text to be identified.

The preferred embodiment and the explanation to institute's application technology principle that above description is only the application.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature The other technical solutions for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein The technical solution that the technical characteristic of energy is replaced mutually and formed.

Claims

A kind of 1. method for identifying text subject, which is characterized in that the described method includes：

Text to be identified is pre-processed to obtain keyword set；

The theme belonging to each keyword in the keyword set is determined at random；

Count the number for the keyword that each theme includes；

Following steps are repeated to each keyword in the keyword set, are changed until result restrains or reaches default Generation number, wherein, the result convergence includes the variation for the keyword distribution for repeating each theme that following steps obtain Amount is less than predetermined threshold：The number for the keyword that the affiliated theme of keyword includes is subtracted one；The probability obtained according to advance training Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one；

The number of the keyword included according to each theme and the number of keyword in the keyword set determine each theme Appear in the probability in text to be identified.
2. according to the method described in claim 1, it is characterized in that, the method further includes the step that training obtains probability distribution Suddenly, wherein, it is described training obtain probability distribution the step of, including：

History text set is obtained, wherein, the history text set includes at least one history text subset, the history text This subset is that the history text set is drawn according to the generated time of text in the history text set and the quantity of text Get；

Pass through each theme in the history text subset where the text subject generation model training acquisition text to be identified Keyword is distributed.
3. according to the method described in claim 2, it is characterized in that, described generated by text subject described in model training acquisition The keyword distribution of each theme in history text subset where text to be identified, including：

The pass that model training obtains each theme in text generation time earliest history text subset is generated by text subject Keyword is distributed；

The keyword distribution of each theme in the history text subset obtained based on training, according to history text subset Chinese This generated time, determine successively in addition to the text generation time earliest subset in other subsets the theme distribution of text with The keyword distribution of each theme.
4. according to the method described in claim 3, it is characterized in that, described generate model training acquisition text by text subject The keyword distribution of each theme in generated time earliest subset, including：

For text in text generation time earliest subset, following steps are performed, until generating the text：

For each theme, a multinomial distribution of sampling out from the distribution of the first Di Li Crays is as the theme on keyword Distribution；

Stochastical sampling goes out length of the value as the text from a discrete probability distribution；

It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays；

For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling The theme gone out is sampled out a keyword in the distribution on keyword.
5. according to the method any one of claim 2-4, which is characterized in that the method further includes：

Calculate word frequency-reverse document-frequency value of each keyword in the keyword set；

The ratio that the number of subset is appeared in response to the word frequency-reverse document-frequency value and the keyword is less than predetermined threshold The keyword is then added in deactivated vocabulary by value；And

The pretreatment deletes stop words for participle and according to the deactivated vocabulary.
6. a kind of device for being used to identify text subject, which is characterized in that described device includes：

Pretreatment unit is configured to that text to be identified is pre-processed to obtain keyword set；

Theme determination unit is configured to determine the theme belonging to each keyword in the keyword set at random；

Statistic unit is configured to count the number for the keyword that each theme includes；

Sampling unit is configured to repeat following steps to each keyword in the keyword set, until result It restrains or reaches default iterations, wherein, the result convergence includes repeating each theme that following steps obtain Keyword distribution variable quantity be less than predetermined threshold：The number for the keyword that the affiliated theme of keyword includes is subtracted one；According to The probability distribution that training obtains in advance is sampled to obtain the affiliated theme of keyword, the keyword that the theme that sampling obtains is included Number add one；

Probability determining unit, the number and key in the keyword set for being configured to the keyword included according to each theme The number of word determines that each theme appears in the probability in text to be identified.
7. device according to claim 6, which is characterized in that described device further includes training unit, the training unit, Including：

Subelement is obtained, is configured to obtain history text set, wherein, the history text set includes at least one history Text subset, the history text subset are according to the generated time of text and the quantity pair of text in the history text set What the history text set divided；

Training subelement, the history being configured to where text subject generation model training obtains the text to be identified are literary Book concentrates the keyword distribution of each theme.
8. device according to claim 7, which is characterized in that the trained subelement is further configured to：

The pass that model training obtains each theme in text generation time earliest history text subset is generated by text subject Keyword is distributed；

The keyword distribution of each theme in the history text subset obtained based on training, according to history text subset Chinese This generated time, determine successively in addition to the text generation time earliest subset in other subsets the theme distribution of text with The keyword distribution of each theme.
9. device according to claim 8, which is characterized in that the trained subelement is further configured to：

For text in text generation time earliest subset, following steps are performed, until generating the text：

For each theme, a multinomial distribution of sampling out from the distribution of the first Di Li Crays is as the theme on keyword Distribution；

Stochastical sampling goes out length of the value as the text from a discrete probability distribution；

It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays；

For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling The theme gone out is sampled out a keyword in the distribution on keyword.
10. according to the device any one of claim 7-9, which is characterized in that it is described pretreatment for participle and according to The deactivated vocabulary deletes stop words；And

Described device further includes：

Computing unit is configured to calculate word frequency-reverse document-frequency value of each keyword in the keyword set；

Unit is added in, is configured to appear in the number of subset with the keyword in response to the word frequency-reverse document-frequency value Ratio be less than predetermined threshold, then the keyword is added in into deactivated vocabulary.