CN108090042A - For identifying the method and apparatus of text subject - Google Patents
For identifying the method and apparatus of text subject Download PDFInfo
- Publication number
- CN108090042A CN108090042A CN201611051277.2A CN201611051277A CN108090042A CN 108090042 A CN108090042 A CN 108090042A CN 201611051277 A CN201611051277 A CN 201611051277A CN 108090042 A CN108090042 A CN 108090042A
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- theme
- distribution
- subset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses for identifying the method and apparatus of text subject.One specific embodiment of this method includes:Text to be identified is pre-processed to obtain keyword set;The theme belonging to each keyword in random definite keyword set;Count the number for the keyword that each theme includes;Following steps are repeated to each keyword in keyword set, until result restrains or reach default iterations:The number for the keyword that the affiliated theme of keyword includes is subtracted one;It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the number for the keyword that the theme that sampling obtains includes is added one;The number of keyword determines that each theme appears in the probability in text to be identified in the number and keyword set of the keyword included according to each theme.This embodiment improves the accuracy of text subject identification.
Description
Technical field
This application involves field of computer technology, and in particular to semantic analysis field more particularly, to identifies text master
The method and apparatus of topic.
Background technology
With the fast development of internet, more and more users sharing into row information by network, since information is more
And it is miscellaneous, how using these information carry out Analysis of Policy Making have become one it is important the problem of.For example, in e-commerce field,
More and more consumers start shopping online, and shopping experience is evaluated, and this commodity evaluating data increases in explosion type
It is long, and these comment datas can provide abundant decision references for electric business and consumer.Therefore, it is necessary to from comment data
Text feature sets out, and rapidly and efficiently identifies the serviceability of comment data, finds the opinion and attitude of consumer, market of going forward side by side
Sense analysis prediction, and Text character extraction is then the committed step of text mining.
At present, text feature is typically the method with statistics or information theory, is picked out and category label
The keyword most shown is as characteristic set.They are mostly established with based on bag of words (Bag of Words, BOW) model, first
The first extracting keywords from text, then using some assessment strategy algorithms, (the anti-text frequency TF-IDF of such as text, information increase
Benefit, mutual information etc.) keyword of most worthy is picked out as feature vector.But the feature vector dimension that this method obtains
Height, for short texts such as comments, the feature vector generated by BOW models will be a very sparse vector, after adding
The difficulty of continuous text-processing;In addition, the short texts such as comment have the characteristics that theme is indefinite, the treatment effect of BOW can be also influenced.
Therefore, such method, the feature extraction for short texts such as comments, the feature vector of extraction is ineffective, identifies text master
The poor accuracy of topic.
The content of the invention
The purpose of the application is to propose a kind of improved method and apparatus for identifying text subject, more than solving
The technical issues of background section is mentioned.
In a first aspect, this application provides a kind of method for identifying text subject, the described method includes:To be identified
Text is pre-processed to obtain keyword set;The theme belonging to each keyword in the keyword set is determined at random;
Count the number for the keyword that each theme includes;Following step is repeated to each keyword in the keyword set
Suddenly, until result restrains or reaches default iterations, wherein, result convergence, which includes repeating following steps, to be obtained
Each theme keyword distribution variable quantity be less than predetermined threshold:By the number for the keyword that the affiliated theme of keyword includes
Subtract one;It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme bag that sampling is obtained
The number of the keyword included adds one;The number of the keyword included according to each theme and keyword in the keyword set
Number determines that each theme appears in the probability in text to be identified.
In some embodiments, the method further includes the step of training obtains probability distribution, wherein, the training obtains
The step of probability distribution, including:History text set is obtained, wherein, the history text set includes at least one history text
This subset, the history text subset are to institute according to the generated time of text and the quantity of text in the history text set
State what history text set divided;History where model training obtains the text to be identified is generated by text subject
The keyword distribution of each theme in text subset.
In some embodiments, it is each in the history text subset by text subject generation model training acquisition
The keyword distribution of theme, including:Model training is generated by text subject and obtains text generation time earliest history text
The keyword distribution of each theme in subset;The keyword distribution of each theme in the history text subset obtained based on training,
According to the generated time of text in the history text subset, it is determined in addition to the text generation time earliest subset successively
The theme distribution of text and the keyword of each theme are distributed in his subset.
In some embodiments, it is described that model training acquisition text generation time earliest subset is generated by text subject
In each theme keyword distribution, including:For text in text generation time earliest subset, following steps are performed, directly
To the generation text:For each theme, a multinomial distribution is sampled out as the theme from the distribution of the first Di Li Crays
Distribution on keyword;Stochastical sampling goes out length of the value as the text from a discrete probability distribution;From second
It samples out distribution of the multinomial distribution as the text on theme in the distribution of Di Li Crays;For each in the text
Keyword, a theme of sampling out from distribution of the text on theme, then point from the theme sampled out on keyword
It samples out in cloth a keyword.
In some embodiments, the method further includes:Calculate the word frequency of each keyword in the keyword set-inverse
To document-frequency value;It is small in response to the word frequency-reverse document-frequency value and the ratio for the number that the keyword appears in subset
In predetermined threshold, then the keyword is added in into deactivated vocabulary;And the pretreatment is for participle and according to the stop words
Table deletes stop words.
Second aspect, this application provides a kind of for identifying the device of text subject, described device includes:Pretreatment is single
Member is configured to that text to be identified is pre-processed to obtain keyword set;Theme determination unit is configured to determine at random
The theme belonging to each keyword in the keyword set;Statistic unit is configured to count the pass that each theme includes
The number of keyword;Sampling unit is configured to repeat following steps to each keyword in the keyword set, directly
Restrained to result or reach default iterations, wherein, result convergence include repeating following steps obtain it is each
The variable quantity of the keyword distribution of a theme is less than predetermined threshold:The number for the keyword that the affiliated theme of keyword includes is subtracted
One;It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme that sampling obtains is included
The number of keyword add one;Probability determining unit, be configured to the number of the keyword included according to each theme with it is described
The number of keyword determines that each theme appears in the probability in text to be identified in keyword set.
In some embodiments, described device further includes training unit, the training unit, including:Subelement is obtained, is matched somebody with somebody
It puts to obtain history text set, wherein, the history text set includes at least one history text subset, the history
Text subset is to the history text set according to the generated time of text in the history text set and the quantity of text
What division obtained;Training subelement is configured to text subject generation model training and obtains the text place to be identified
History text subset in each theme keyword distribution.
In some embodiments, the trained subelement, is further configured to:Model training is generated by text subject
Obtain the keyword distribution of each theme in text generation time earliest history text subset;The history text obtained based on training
Book concentrates the keyword distribution of each theme, according to the generated time of text in the history text subset, determines to remove successively
The theme distribution of text is distributed with the keyword of each theme in other outer subsets of the text generation time earliest subset.
In some embodiments, the trained subelement, is further configured to:For the son that the text generation time is earliest
Text is concentrated, performs following steps, until generating the text:For each theme, sample out from the distribution of the first Di Li Crays
Distribution of one multinomial distribution as the theme on keyword;Stochastical sampling goes out a value from a discrete probability distribution
Length as the text;From the second Di Li Crays distribution in sample out a multinomial distribution as the text on theme
Distribution;For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling
The theme gone out is sampled out a keyword in the distribution on keyword.
In some embodiments, the pretreatment deletes stop words for participle and according to the deactivated vocabulary;And institute
Device is stated to further include:Computing unit is configured to calculate word frequency-reverse file frequency of each keyword in the keyword set
Rate value;Unit is added in, is configured to appear in the number of subset with the keyword in response to the word frequency-reverse document-frequency value
Ratio be less than predetermined threshold, then the keyword is added in into deactivated vocabulary.
The application provide for the method and apparatus that identify text subject, by text to be identified pre-process
To keyword set, the theme belonging to the random each keyword determined in keyword set, and count what each theme included
The number of keyword then repeats following steps to each keyword in keyword set, until result restrains or reaches
To default iterations:The number for the keyword that the affiliated theme of keyword includes is subtracted one, is obtained according to advance training general
Rate distribution is sampled to obtain the affiliated theme of keyword, the number for the keyword that the theme that sampling obtains includes is added one, finally
The number of keyword determines that each theme appears in and treats in the number and keyword set of the keyword included according to each theme
It identifies the probability in text, improves the accuracy of text subject identification.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the flow chart for being used to identify one embodiment of the method for text subject according to the application;
Fig. 3 is the schematic diagram for being used to identify an application scenarios of the method for text subject according to the application;
Fig. 4 is the flow chart for being used to identify another embodiment of the method for text subject according to the application;
Fig. 5 is the structure diagram for being used to identify one embodiment of the device of text subject according to the application;
Fig. 6 is adapted for the structure diagram of the computer system of the server for realizing the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, illustrated only in attached drawing and invent relevant part with related.
It should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1, which is shown, to identify the exemplary of the embodiment of the method or apparatus of text subject using the application
System architecture 100.
As shown in Figure 1, system architecture 100 can include terminal device 101,102,103, network 104 and server 105.
Network 104 between terminal device 101,102,103 and server 105 provide transmission link medium.Network 104 can be with
Including various connection types, such as wired, wireless transmission link or fiber optic cables etc..
User can be interacted with using terminal equipment 101,102,103 by network 104 with server 105, to receive or send out
Send message etc..Various applications can be installed on terminal device 101,102,103, for example, the application of e-commerce class, instant messaging
Class application, the application of browser class, searching class application, the application of word processing class etc..
Terminal device 101,102,103 can be various electronic equipments, include but not limited to smart mobile phone, tablet computer,
E-book reader, pocket computer on knee and desktop computer etc..
Server 105 can be the background server that support is provided for the application installed on terminal device 101,102,103,
For example, server 105 can obtain the text to be identified of the upload of terminal device 101,102,103, by pre-processing and sampling
Step determines that each theme appears in the probability in text to be identified, can also further root so as to identify the theme of text
Category filter etc. is carried out to text according to theme;Server 105 equally can also be to the text to be identified that is stored in other servers
The identification of theme is carried out, is asked afterwards in response to terminal device 101,102,103 for the view as text of a certain theme, to terminal
Equipment 101,102,103 sends the text of the theme.
It should be noted that the method for being used to identify text subject that the embodiment of the present application is provided is generally by server
105 perform, and correspondingly, the device for identifying text subject is generally positioned in server 105.
It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need
Will, can have any number of terminal device, network and server.
It please refers to Fig.2, it illustrates the streams for being used to identify one embodiment of the method for text subject according to the application
Journey 200.It should be noted that the method for being used to identify text subject that the embodiment of the present application is provided is generally by the clothes in Fig. 1
Business device 105 performs.This method comprises the following steps:
Step 201, text to be identified is pre-processed to obtain keyword set.
In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1
Server) can text to be identified be obtained at terminal or other servers by wired connection mode or radio connection
This, and to being pre-processed to obtain keyword set to text to be identified.Text to be identified can be user in shopping website
The text that user issues in comment text or social networks.Pretreatment can include participle, delete the operations such as stop words.Keyword
Set can be presented in a manner of word sequence.Each keyword in keyword set can be considered as by " with certain
Such a process of some theme of probability selection, and from this theme with some word of certain probability selection " obtains.
Step 202, the theme belonging to each keyword in keyword set is determined at random.
In the present embodiment, based on the keyword set obtained in step 201, above-mentioned electronic equipment can be random true first
Determine the theme belonging to each keyword in keyword set, be equivalent to and be sampled for the theme of text, with each key
Theme belonging to word assigns initial value.Can pre-set the number of theme, the setting principle of theme number can be each theme it
Between similarity it is smaller, theme number is better.
Step 203, the number for the keyword that each theme includes is counted.
In the present embodiment, based on each keyword determined in step 202 belonging to theme, above-mentioned electronic equipment can be with
Count the number for the keyword that each theme includes.
Step 204, following steps are repeated to each keyword in keyword set, until result restrains or reaches
Default iterations:The number for the keyword that the affiliated theme of keyword includes is subtracted one;The probability obtained according to advance training
Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one.
In the present embodiment, based on the number that the keyword that each theme includes is counted in step 203, above-mentioned electronic equipment
Following steps can be repeated to each keyword in keyword set, until result restrains or reach default iteration time
Number:The number for the keyword that the affiliated theme of keyword includes is subtracted one;It is sampled according to the probability distribution that advance training obtains
The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, wherein, as a result convergence includes weight
The variable quantity for performing the keyword distribution for each theme that following steps obtain again is less than predetermined threshold, the keyword of each theme
The variable quantity of distribution can be the variable quantity of the number for the keyword that each main body includes or the variable quantity of species.This step can be with
It is that approximate solution is carried out by Method of Stochastic, as an example it is supposed that the keyword of the theme distribution of text and theme point
Cloth is the bi-distribution that generation is sampled from the Di Li Crays distribution that parameter is α and parameter is β respectively, then can be according to following public affairs
Formula is sampled:
Wherein, p (zi=k | m, α, β, t) represent that the theme of i-th of word in text to be identified is the theme the probability of k, t refers to text
Timeslice residing for this, since history text subset was divided according to the time, the timeslice residing for text can be according to its institute
Place's timeslice is identified, θm,kRepresent that theme k appears in the probability in text m,Represent that theme k includes word ziProbability,
nmRepresent the sum of word in text m to be identified.VtRepresent the sum of word in timeslice t text sets.Represent that theme k goes out in text m
Existing number,Represent that word v appears in the number of theme k, nkRepresent the sum for the word that theme k is included, parameter is α and ginseng
Number can be α=50/K for the value of β, and β=0.01, wherein K are the sum of pre-set theme.
Sampling can use gibbs sampler algorithm, gibbs sampler be MCMC (Markov Chain Monte Carlo,
Markov Mo Te Carlows) one kind in algorithm, for constructing the random sample of multivariate probability distribution, for example, construction two or
The Joint Distribution of multiple variables, quadratures, and it is expected.Gibbs sampler needs a convergent process, is only reaching equilibrium-like
The sample of target distribution when the sample obtained when state can be just equilibrium state, therefore, the sample generated when not converged
It is required for being rejected.How to judge a process whether to have reached equilibrium state that ripe method solves not yet, mesh
Preceding common method is to see whether state steadily (such as draws a figure, if during longer, variation is not
Greatly, illustrate probably to have balanced).The side such as graphical method or Monte Carlo error may be employed on the convergent method of inspection
Method more reaches the iteration of equilibrium state in engineering practice by experience and to the observation of data to specify in gibbs sampler
Number.
In some optional realization methods of the present embodiment, the above method further includes the step that training obtains probability distribution
Suddenly, wherein, it is above-mentioned training obtain probability distribution the step of, including:History text set is obtained, wherein, the history text collection
Conjunction includes at least one history text subset, and the history text subset is the generation according to text in the history text set
The quantity of time and text divides the history text set;It is generated by text subject described in model training acquisition
The keyword distribution of each theme in history text subset where text to be identified.
In some optional realization methods of the present embodiment, obtained above by text subject generation model training above-mentioned
The keyword distribution of each theme in history text subset, including:Model training is generated by text subject and obtains text generation
The keyword distribution of each theme in time earliest history text subset;It is each in the history text subset obtained based on training
The keyword distribution of theme according to the generated time of text in the history text subset, is determined successively except the text generation
The theme distribution of text is distributed with the keyword of each theme in other outer subsets of time earliest subset.
In some optional realization methods of the present embodiment, text is obtained above by text subject generation model training
The keyword distribution of each theme in generated time earliest subset, including:For the subset Chinese that the text generation time is earliest
This, performs following steps, until generating the text:For each theme, sample out from the distribution of the first Di Li Crays more than one
Item formula is distributed the distribution as the theme on keyword;Stochastical sampling goes out a value conduct and is somebody's turn to do from a discrete probability distribution
The length of text;It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays;
For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling out
Theme is sampled out a keyword in the distribution on keyword.
In some optional realization methods of the present embodiment, the above method further includes:It calculates in above-mentioned keyword set
The word frequency of each keyword-reverse document-frequency value;It is appeared in response to above-mentioned word frequency-reverse document-frequency value with the keyword
The ratio of the number of history text subset is less than predetermined threshold, then above-mentioned keyword is added in deactivated vocabulary;And above-mentioned pre- place
It manages as participle and stop words is deleted according to above-mentioned deactivated vocabulary.Predetermined threshold can be set according to actual needs, such as can be with
0.01 is arranged to, TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-reverse can be passed through
Document-frequency) algorithm calculates the word frequency of each keyword-reverse document-frequency value.TF (term frequency, word frequency) etc.
In the number of word appearance divided by total word number of this document.IDF (inverse document frequency, reverse file
Frequency) be a word general importance measurement, the IDF of a certain particular words by general act number divided by can include this
The number of the file of word, then obtained business is taken the logarithm to obtain, the calculating of IDF based on the corpus obtained in advance or can be gone through
History text collection is whole, and corresponding general act number can be the number of file or history text set in corpus
The number of middle text.As an example, for keyword w, ifThen keyword w additions are stopped
With vocabulary, wherein, NwRepresent that word w appears in the number of history text subset, Tfidf(w) the TF-IDF values of word w are represented, ψ is pre-
Determine threshold value.The word frequency of word equally can also be constantly counted, word is more than certain threshold values in the probability that text occurs, is included
Deactivated dictionary.
Step 205, the number of keyword determines in the number and keyword set of the keyword included according to each theme
Each theme appears in the probability in text to be identified.
In the present embodiment, above-mentioned electronic equipment can be according to obtaining keyword that each theme includes in step 204
Number and the number of keyword in keyword set determine that each theme appears in the probability in text to be identified.It can be according to pass
The ratio of each affiliated theme of keyword appears in the probability in text to be identified as theme in keyword set, can also use it
More accurately statistical method is calculated for he, as an example it is supposed that the theme distribution of text is from the Di Li Crays that parameter is α
The bi-distribution of generation is sampled in distribution, then probability P (the z in text m can be appeared according to the following formula calculating theme ki=k |
m):
Wherein, θm,kRepresent that theme k appears in the probability in text m, nmRepresent the sum of word in text m to be identified.VtTable
Show the sum of word in timeslice T text sets.Represent the number that theme k occurs in text m,Represent that word v appears in theme k
Number, nkRepresent the sum for the word that theme k is included.
With continued reference to Fig. 3, Fig. 3 is one that is used to identify the application scenarios of the method for text subject according to the present embodiment
Schematic diagram.In the application scenarios of Fig. 3, for identify the frame 300 of text subject mainly include training obtain probability distribution and
The theme two parts of identification comment in real time.Wherein, the step of training acquisition probability distribution includes:Step 301, in response to historical review
The quantity of training comment does not reach preset value in set, not training comment is obtained as training comment set is treated, wherein, it is necessary to real
When identification comment can also be added in historical review set.Step 302, obtain treating that training is commented using text subject generation model
The keyword distribution of each theme during analects closes.Step 303, according to comment the time earlier than treat training comment set in all comments
Comment set in each theme keyword distribution of the keyword distribution with treating each theme in training comment set determine it is real
When identification needed for each theme keyword distribution.As an example, the quantity that training is not commented in historical review set is for the first time
Reach preset value, generating model by text subject for the first time obtains the keyword distribution of each themeHistorical review set
In the quantity of training comment reach preset value for the second time, generating model by text subject obtains the keyword point of each theme
ClothIt is comprehensiveWithObtain historical review set currently each theme keyword distributionπ can be with
Value is carried out in the range of 0 to 1, and so on, smoothing processing is made to the distribution that training obtains every time.
The step of identification comment theme includes in real time:Step 304, pretreatment obtains keyword set.Step 305, at random
Determine the theme belonging to each keyword in the keyword set.Step 306, each keyword in keyword set is held
The number for the keyword that the affiliated theme of keyword includes is subtracted one by row step 3061;Step 3062, sampling redefine the key
The affiliated theme of word, sampling can be carried out according to the probability distribution trained in advance, since identification comment in real time being needed also to add in
It has arrived in historical review set, should also include what is obtained after it is pre-processed in the keyword distribution of each main body trained in advance
Keyword;Step 3063, the number for the keyword that the theme that sampling obtains includes is added one.Step 307, whether judging result is received
It holds back or whether reaches default iterations, if yes then enter step 308, if otherwise return to step 306.Step 308, calculate
Each theme appears in the probability in comment to be identified.
The method that above-described embodiment of the application provides by being pre-processed to obtain keyword set to text to be identified,
The theme belonging to each keyword in random definite keyword set, and the number for the keyword that each theme includes is counted,
Following steps then are repeated to each keyword in keyword set, until result restrains or reach default iteration time
Number:The number for the keyword that the affiliated theme of keyword includes is subtracted one, is sampled according to the probability distribution that advance training obtains
The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, finally according to each theme bag
It is general in text to be identified to determine that each theme appears in for the number of keyword in the number and keyword set of the keyword included
Rate improves the accuracy of text subject identification.
With further reference to Fig. 4, it illustrates for identifying the flow 400 of another embodiment of the method for text subject.
This is used for the flow 400 for identifying the method for text subject, comprises the following steps:
Step 401, history text set is obtained.
In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1
Server) history text set can be obtained.Wherein, the history text set includes at least one history text subset, institute
It is to history text according to the generated time of text in the history text set and the quantity of text to state history text subset
The division of this set obtains.History text set is dynamic, and above-mentioned electronic equipment can be constantly got including text to be identified
New text inside, and add it to and training corpus is used as in history text set.
Step 402, model training is generated by text subject to obtain in text generation time earliest history text subset
The keyword distribution of each theme.
In the present embodiment, based on each subset in the history text set obtained in step 401, above-mentioned electronic equipment
The pass that model training obtains each theme in text generation time earliest history text subset can be generated by text subject
Keyword is distributed.The theme distribution of text in history text subset, text master can equally be obtained by generating model by text subject
Topic generation model can be LDA (Latent Dirichlet Allocation imply the distribution of Di Li Crays) model, hLDA
(hierarchical Latent Dirichlet Allocation are layered implicit Di Li Crays distribution) model, HDP
(hierarchical Dirichlet process, layering Di Li Crays process) model or GSDMM (collapsed Gibbs
Sampling algorithm for the Dirichlet Multinomial Mixture Model are more for Di Li Crays
The folding Gibbs model algorithm of item mixed model).As an example, text subject generation model is using LDA models, it is existing to go through
History text collection D has carried out sequential division according to the generated time of text, has been divided into multiple subclass D={ d1,d2…
di..., the time span of each subclass is Ti=τ * T0, for the sequential factor, its value range is τ >=1, T to wherein τ0On the basis of
Span interval.To each subset diModel training is carried out, to diEvery text m, by LDA obtain the theme of the text to
Measure zm, and count number that word in each theme occurs and comprising word sum.Specific training process is as follows:It will be in text
Each word be mapped to probability space, obtain short text explanation vector, recycle Gibbs model method in concept space
The upper LDA models that pass through are clustered to explaining vector, text modeling, obtain the probability distribution of " text-theme-word ".Final
To the theme vector of text, and using theme as the feature vector of the text.
The text set that the input of text generation algorithm based on LDA models obtains after being pre-processed for antithetical phrase concentration text
The parameter alpha of W, Di Li Cray distribution, β and number of topics K export the probability matrix θ for being text on theme and theme in keyword
On probability matrixWherein, parameter alpha, the value of β can be α=50/K, β=0.01.Text can be calculated in timeslice Ti
θiWithIt is a multinomial distribution conduct of sampling out during each theme is distributed from the Di Li Crays that a parameter is β first
The keyword distribution of themeThen to every text m, a value is sampled out from a Poisson distribution
As text size Nm~Poiss (ξ), then a multinomial distribution of sampling out from the Di Li Crays distribution that a parameter is α
Distribution θ as theme under the textm~Dirichlet (α).It is first main under the text finally, for each word in text
One theme z of sampling generation in the multinomial distribution of topicm,n~Mult (θm), then from sampling out under the theme in the multinomial distribution of word
One wordThis random generating process is constantly repeated, until generating the full text in text set, together
When can count the total N that each word appears in current text collectionw。
Step 403, the keyword distribution of each theme in the history text subset obtained based on training, according to history text
The generated time of text in subset, the theme of text divides in other subsets in addition to the definite subset earliest except the text generation time successively
The keyword of cloth and each theme is distributed.
In the present embodiment, based on each in the text generation time obtained in step 402 earliest history text subset
The keyword distribution of theme, above-mentioned electronic equipment can the keywords based on each theme in the history text subset for training acquisition
Distribution according to the generated time of text in history text subset, determines in addition to text generation time earliest subset other successively
The theme distribution of text and the keyword of each theme are distributed in subset.As an example, by text generation time earliest history
Text subset is as timeslice T1Corresponding subset can combine piece T between having been obtained in step 4021The theme of text in corresponding subset
It is distributed and the subset of the second morning of text generation time of the distribution of the keyword of theme and text subject generation model training acquisition is
Timeslice T2The keyword of the theme distribution of text and theme is distributed in corresponding subset, obtains the theme distribution θ of newest text
It is distributed with the keyword of themeAnd so on, it can be with binding time piece Ti-1The θ of corresponding subseti-1WithThen probability matrix θ=π θ of the text on themei+
(1-π)θi-1, probability matrix of the theme on wordIn view of timeslice T1No pair of corresponding subset
The timeslice answered may be referred to, therefore for timeslice T1Corresponding subset takes π=1.So by drawing sequence concept, smoothly
Processing, dynamic update model parameter, can obtain the feature of online text in real time.
Step 404, text to be identified is pre-processed to obtain keyword set.
In the present embodiment, for identifying that the electronic equipment of the method for text subject operation thereon is (such as shown in FIG. 1
Server) can text to be identified be obtained at terminal or other servers by wired connection mode or radio connection
This, and to being pre-processed to obtain keyword set to text to be identified.
Step 405, the theme belonging to each keyword in keyword set is determined at random.
In the present embodiment, based on the keyword set obtained in step 404, above-mentioned electronic equipment can be random true first
Determine the theme belonging to each keyword in keyword set, be equivalent to and be sampled for the theme of text, with each key
Theme belonging to word assigns initial value.Can pre-set the number of theme, the setting principle of theme number can be each theme it
Between similarity it is smaller, theme number is better.
Step 406, the number for the keyword that each theme includes is counted.
In the present embodiment, based on each keyword determined in step 405 belonging to theme, above-mentioned electronic equipment can be with
Count the number for the keyword that each theme includes.
Step 407, following steps are repeated to each keyword in keyword set, until result restrains or reaches
Default iterations:The number for the keyword that the affiliated theme of keyword includes is subtracted one;The probability obtained according to advance training
Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one.
In the present embodiment, based on the number that the keyword that each theme includes is counted in step 406, above-mentioned electronic equipment
Following steps can be repeated to each keyword in keyword set, until result restrains or reach default iteration time
Number:The number for the keyword that the affiliated theme of keyword includes is subtracted one;It is sampled according to the probability distribution that advance training obtains
The affiliated theme of keyword is obtained, the number for the keyword that the theme that sampling obtains includes is added one, wherein, as a result convergence includes weight
The variable quantity for performing the keyword distribution for each theme that following steps obtain again is less than predetermined threshold, the keyword of each theme
The variable quantity of distribution can be the variable quantity of the number for the keyword that each main body includes or the variable quantity of species.This step can be with
It is that approximate solution is carried out by Method of Stochastic.
Step 408, the number of keyword determines in the number and keyword set of the keyword included according to each theme
Each theme appears in the probability in text to be identified.
In the present embodiment, above-mentioned electronic equipment can be according to obtaining keyword that each theme includes in step 407
Number and the number of keyword in keyword set determine that each theme appears in the probability in text to be identified.It can be according to pass
The ratio of each affiliated theme of keyword appears in the probability in text to be identified as theme in keyword set, can also use it
More accurately statistical method carries out calculating for he
Figure 4, it is seen that unlike embodiment corresponding from Fig. 2, training is added in the present embodiment and is obtained generally
The step of rate is distributed to carry out feature extraction to short texts such as similar comments, by calculating the probability of each theme, and is made
For the feature vector that short text is final, solve the problems, such as that the feature of short text is high-dimensional and sparse indefinite with feature.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for identifying text
One embodiment of the device of this theme, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, which specifically may be used
To be applied in various electronic equipments.
Include as shown in figure 5, the present embodiment is above-mentioned for the device 500 that identifies text subject:Pretreatment unit 501,
Theme determination unit 502, statistic unit 503, sampling unit 504 and probability determining unit 505.Wherein, pretreatment unit 501,
It is configured to that text to be identified is pre-processed to obtain keyword set;Theme determination unit 502 is configured to determine at random
The theme belonging to each keyword in the keyword set;Statistic unit 503 is configured to count what each theme included
The number of keyword;Sampling unit 504 is configured to repeat following step to each keyword in the keyword set
Suddenly, until result restrains or reaches default iterations, wherein, result convergence, which includes repeating following steps, to be obtained
Each theme keyword distribution variable quantity be less than predetermined threshold:By the number for the keyword that the affiliated theme of keyword includes
Subtract one;It is sampled to obtain the affiliated theme of keyword according to the probability distribution that advance training obtains, the theme bag that sampling is obtained
The number of the keyword included adds one;Probability determining unit 505, be configured to the number of the keyword included according to each theme with
The number of keyword determines that each theme appears in the probability in text to be identified in the keyword set.
In the present embodiment, for identifying pretreatment unit 501, theme determination unit in the device 500 of text subject
502nd, the specific processing of statistic unit 503, sampling unit 504 and probability determining unit 505 can be corresponded to referring to Fig. 2 in embodiment
Step 201, step 202, step 203 step 204 and step 205 realization method associated description, details are not described herein.
In some optional realization methods of the present embodiment, described device further includes training unit (not shown), described
Training unit, including:Subelement (not shown) is obtained, is configured to obtain history text set, wherein, the history text collection
Conjunction includes at least one history text subset, and the history text subset is the generation according to text in the history text set
The quantity of time and text divides the history text set;Training subelement (not shown), is configured to
The keyword point of each theme in history text subset where the text subject generation model training acquisition text to be identified
Cloth.
In some optional realization methods of the present embodiment, the trained subelement is further configured to:Pass through text
This theme generation model training obtains the keyword distribution of each theme in text generation time earliest history text subset;Base
The keyword distribution of each theme in the history text subset obtained in training, according to the life of text in the history text subset
Into the time, the theme distribution of text and each master in other subsets are determined in addition to the text generation time earliest subset successively
The keyword distribution of topic.
In some optional realization methods of the present embodiment, the trained subelement is further configured to:For text
Text in this generated time earliest subset performs following steps, until generating the text:For each theme, from first Di
It samples out in the distribution of sharp Cray distribution of the multinomial distribution as the theme on keyword;From a discrete probability distribution
Middle stochastical sampling goes out length of the value as the text;A multinomial distribution of sampling out from the distribution of the second Di Li Crays is made
The distribution for being the text on theme;For each keyword in the text, sample from distribution of the text on theme
Go out a theme, then a keyword of sampling out from distribution of the theme sampled out on keyword.
In some optional realization methods of the present embodiment, the pretreatment is for participle and according to the deactivated vocabulary
Delete stop words;And described device further includes:Computing unit (not shown) is configured to calculate in the keyword set every
The word frequency of a keyword-reverse document-frequency value;Unit (not shown) is added in, is configured in response to the word frequency-reverse text
The ratio for the number that part frequency values appear in subset with the keyword is less than predetermined threshold, then the keyword is added in stop words
Table.
Below with reference to Fig. 6, it illustrates suitable for being used for realizing the computer system 600 of the server of the embodiment of the present application
Structure diagram.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in
Program in memory (ROM) 602 or be loaded into program in random access storage device (RAM) 603 from storage part 608 and
Perform various appropriate actions and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data.
CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always
Line 604.
I/O interfaces 605 are connected to lower component:Importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage part 608 including hard disk etc.;
And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because
The network of spy's net performs communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 610, as needed in order to read from it
Computer program be mounted into as needed storage part 608.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product, it is machine readable including being tangibly embodied in
Computer program on medium, the computer program are included for the program code of the method shown in execution flow chart.At this
In the embodiment of sample, which can be downloaded and installed from network by communications portion 609 and/or from removable
Medium 611 is unloaded to be mounted.When the computer program is performed by central processing unit (CPU) 601, perform in the present processes
The above-mentioned function of limiting.
Flow chart and block diagram in attached drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey
Architectural framework in the cards, function and the operation of sequence product.In this regard, each box in flow chart or block diagram can generation
The part of one module of table, program segment or code, a part for the module, program segment or code include one or more
The executable instruction of logic function as defined in being used to implement.It should also be noted that some as replace realization in, institute in box
The function of mark can also be occurred with being different from the order marked in attached drawing.For example, two boxes succeedingly represented are actual
On can perform substantially in parallel, they can also be performed in the opposite order sometimes, this is depending on involved function.Also
It is noted that the combination of each box in block diagram and/or flow chart and the box in block diagram and/or flow chart, Ke Yiyong
The dedicated hardware based systems of functions or operations as defined in execution is realized or can referred to specialized hardware and computer
The combination of order is realized.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit can also be set in the processor, for example, can be described as:A kind of processor bag
Include pretreatment unit, theme determination unit, statistic unit, sampling unit and probability determining unit.Wherein, the title of these units
The restriction to the unit in itself is not formed under certain conditions, for example, pretreatment unit is also described as " to be identified
Text is pre-processed to obtain the unit of keyword set ".
As on the other hand, present invention also provides a kind of nonvolatile computer storage media, the non-volatile calculating
Machine storage medium can be nonvolatile computer storage media included in device described in above-described embodiment;Can also be
Individualism, without the nonvolatile computer storage media in supplying terminal.Above-mentioned nonvolatile computer storage media is deposited
One or more program is contained, when one or more of programs are performed by an equipment so that the equipment:It treats
Identification text is pre-processed to obtain keyword set;The theme belonging to each keyword in random definite keyword set;
Count the number for the keyword that each theme includes;Following steps are repeated to each keyword in keyword set, directly
It is restrained to result or reaches default iterations, wherein, as a result convergence includes repeating each master that following steps obtain
The variable quantity of the keyword distribution of topic is less than predetermined threshold:The number for the keyword that the affiliated theme of keyword includes is subtracted one;Root
The probability distribution obtained according to advance training is sampled to obtain the affiliated theme of keyword, the key that the theme that sampling obtains is included
The number of word adds one;The number of keyword determines each in the number and keyword set of the keyword included according to each theme
Theme appears in the probability in text to be identified.
The preferred embodiment and the explanation to institute's application technology principle that above description is only the application.People in the art
Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms
Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature
The other technical solutions for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein
The technical solution that the technical characteristic of energy is replaced mutually and formed.
Claims (10)
- A kind of 1. method for identifying text subject, which is characterized in that the described method includes:Text to be identified is pre-processed to obtain keyword set;The theme belonging to each keyword in the keyword set is determined at random;Count the number for the keyword that each theme includes;Following steps are repeated to each keyword in the keyword set, are changed until result restrains or reaches default Generation number, wherein, the result convergence includes the variation for the keyword distribution for repeating each theme that following steps obtain Amount is less than predetermined threshold:The number for the keyword that the affiliated theme of keyword includes is subtracted one;The probability obtained according to advance training Distribution is sampled to obtain the affiliated theme of keyword, and the number for the keyword that the theme that sampling obtains includes is added one;The number of the keyword included according to each theme and the number of keyword in the keyword set determine each theme Appear in the probability in text to be identified.
- 2. according to the method described in claim 1, it is characterized in that, the method further includes the step that training obtains probability distribution Suddenly, wherein, it is described training obtain probability distribution the step of, including:History text set is obtained, wherein, the history text set includes at least one history text subset, the history text This subset is that the history text set is drawn according to the generated time of text in the history text set and the quantity of text Get;Pass through each theme in the history text subset where the text subject generation model training acquisition text to be identified Keyword is distributed.
- 3. according to the method described in claim 2, it is characterized in that, described generated by text subject described in model training acquisition The keyword distribution of each theme in history text subset where text to be identified, including:The pass that model training obtains each theme in text generation time earliest history text subset is generated by text subject Keyword is distributed;The keyword distribution of each theme in the history text subset obtained based on training, according to history text subset Chinese This generated time, determine successively in addition to the text generation time earliest subset in other subsets the theme distribution of text with The keyword distribution of each theme.
- 4. according to the method described in claim 3, it is characterized in that, described generate model training acquisition text by text subject The keyword distribution of each theme in generated time earliest subset, including:For text in text generation time earliest subset, following steps are performed, until generating the text:For each theme, a multinomial distribution of sampling out from the distribution of the first Di Li Crays is as the theme on keyword Distribution;Stochastical sampling goes out length of the value as the text from a discrete probability distribution;It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays;For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling The theme gone out is sampled out a keyword in the distribution on keyword.
- 5. according to the method any one of claim 2-4, which is characterized in that the method further includes:Calculate word frequency-reverse document-frequency value of each keyword in the keyword set;The ratio that the number of subset is appeared in response to the word frequency-reverse document-frequency value and the keyword is less than predetermined threshold The keyword is then added in deactivated vocabulary by value;AndThe pretreatment deletes stop words for participle and according to the deactivated vocabulary.
- 6. a kind of device for being used to identify text subject, which is characterized in that described device includes:Pretreatment unit is configured to that text to be identified is pre-processed to obtain keyword set;Theme determination unit is configured to determine the theme belonging to each keyword in the keyword set at random;Statistic unit is configured to count the number for the keyword that each theme includes;Sampling unit is configured to repeat following steps to each keyword in the keyword set, until result It restrains or reaches default iterations, wherein, the result convergence includes repeating each theme that following steps obtain Keyword distribution variable quantity be less than predetermined threshold:The number for the keyword that the affiliated theme of keyword includes is subtracted one;According to The probability distribution that training obtains in advance is sampled to obtain the affiliated theme of keyword, the keyword that the theme that sampling obtains is included Number add one;Probability determining unit, the number and key in the keyword set for being configured to the keyword included according to each theme The number of word determines that each theme appears in the probability in text to be identified.
- 7. device according to claim 6, which is characterized in that described device further includes training unit, the training unit, Including:Subelement is obtained, is configured to obtain history text set, wherein, the history text set includes at least one history Text subset, the history text subset are according to the generated time of text and the quantity pair of text in the history text set What the history text set divided;Training subelement, the history being configured to where text subject generation model training obtains the text to be identified are literary Book concentrates the keyword distribution of each theme.
- 8. device according to claim 7, which is characterized in that the trained subelement is further configured to:The pass that model training obtains each theme in text generation time earliest history text subset is generated by text subject Keyword is distributed;The keyword distribution of each theme in the history text subset obtained based on training, according to history text subset Chinese This generated time, determine successively in addition to the text generation time earliest subset in other subsets the theme distribution of text with The keyword distribution of each theme.
- 9. device according to claim 8, which is characterized in that the trained subelement is further configured to:For text in text generation time earliest subset, following steps are performed, until generating the text:For each theme, a multinomial distribution of sampling out from the distribution of the first Di Li Crays is as the theme on keyword Distribution;Stochastical sampling goes out length of the value as the text from a discrete probability distribution;It samples out distribution of the multinomial distribution as the text on theme from the distribution of the second Di Li Crays;For each keyword in the text, a theme of sampling out from distribution of the text on theme, then from sampling The theme gone out is sampled out a keyword in the distribution on keyword.
- 10. according to the device any one of claim 7-9, which is characterized in that it is described pretreatment for participle and according to The deactivated vocabulary deletes stop words;AndDescribed device further includes:Computing unit is configured to calculate word frequency-reverse document-frequency value of each keyword in the keyword set;Unit is added in, is configured to appear in the number of subset with the keyword in response to the word frequency-reverse document-frequency value Ratio be less than predetermined threshold, then the keyword is added in into deactivated vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611051277.2A CN108090042A (en) | 2016-11-23 | 2016-11-23 | For identifying the method and apparatus of text subject |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611051277.2A CN108090042A (en) | 2016-11-23 | 2016-11-23 | For identifying the method and apparatus of text subject |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108090042A true CN108090042A (en) | 2018-05-29 |
Family
ID=62171770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611051277.2A Pending CN108090042A (en) | 2016-11-23 | 2016-11-23 | For identifying the method and apparatus of text subject |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090042A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299364A (en) * | 2018-09-26 | 2019-02-01 | 贵州大学 | A kind of short text dynamic cluster method with new theme skewed popularity |
CN110083774A (en) * | 2019-05-10 | 2019-08-02 | 腾讯科技(深圳)有限公司 | Using determination method, apparatus, computer equipment and the storage medium of recommendation list |
CN111221880A (en) * | 2020-04-23 | 2020-06-02 | 北京瑞莱智慧科技有限公司 | Feature combination method, device, medium, and electronic apparatus |
CN111222319A (en) * | 2019-11-14 | 2020-06-02 | 电子科技大学 | Document information extraction method based on novel HDP model |
CN111339296A (en) * | 2020-02-20 | 2020-06-26 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN112668306A (en) * | 2020-12-22 | 2021-04-16 | 延边大学 | Language processing method and system based on statement discrimination recognition and reinforcement learning action design |
CN113326385A (en) * | 2021-08-04 | 2021-08-31 | 北京达佳互联信息技术有限公司 | Target multimedia resource acquisition method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345474A (en) * | 2013-07-25 | 2013-10-09 | 苏州大学 | Online tracking method for document theme |
CN103870563A (en) * | 2014-03-07 | 2014-06-18 | 北京奇虎科技有限公司 | Method and device for determining subject distribution of given text |
US20150317303A1 (en) * | 2014-04-30 | 2015-11-05 | Linkedin Corporation | Topic mining using natural language processing techniques |
CN105786791A (en) * | 2014-12-23 | 2016-07-20 | 深圳市腾讯计算机系统有限公司 | Data topic acquisition method and apparatus |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
-
2016
- 2016-11-23 CN CN201611051277.2A patent/CN108090042A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103345474A (en) * | 2013-07-25 | 2013-10-09 | 苏州大学 | Online tracking method for document theme |
CN103870563A (en) * | 2014-03-07 | 2014-06-18 | 北京奇虎科技有限公司 | Method and device for determining subject distribution of given text |
US20150317303A1 (en) * | 2014-04-30 | 2015-11-05 | Linkedin Corporation | Topic mining using natural language processing techniques |
CN105786791A (en) * | 2014-12-23 | 2016-07-20 | 深圳市腾讯计算机系统有限公司 | Data topic acquisition method and apparatus |
CN105975499A (en) * | 2016-04-27 | 2016-09-28 | 深圳大学 | Text subject detection method and system |
Non-Patent Citations (1)
Title |
---|
崔凯: "基于LDA的主题演化研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299364A (en) * | 2018-09-26 | 2019-02-01 | 贵州大学 | A kind of short text dynamic cluster method with new theme skewed popularity |
CN110083774A (en) * | 2019-05-10 | 2019-08-02 | 腾讯科技(深圳)有限公司 | Using determination method, apparatus, computer equipment and the storage medium of recommendation list |
CN110083774B (en) * | 2019-05-10 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Method and device for determining application recommendation list, computer equipment and storage medium |
CN111222319B (en) * | 2019-11-14 | 2021-09-14 | 电子科技大学 | Document information extraction method based on HDP model |
CN111222319A (en) * | 2019-11-14 | 2020-06-02 | 电子科技大学 | Document information extraction method based on novel HDP model |
CN111339296A (en) * | 2020-02-20 | 2020-06-26 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN111339296B (en) * | 2020-02-20 | 2023-03-28 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN111221880B (en) * | 2020-04-23 | 2021-01-22 | 北京瑞莱智慧科技有限公司 | Feature combination method, device, medium, and electronic apparatus |
CN111221880A (en) * | 2020-04-23 | 2020-06-02 | 北京瑞莱智慧科技有限公司 | Feature combination method, device, medium, and electronic apparatus |
CN112668306A (en) * | 2020-12-22 | 2021-04-16 | 延边大学 | Language processing method and system based on statement discrimination recognition and reinforcement learning action design |
CN112668306B (en) * | 2020-12-22 | 2021-07-27 | 延边大学 | Language processing method and system based on statement discrimination recognition and reinforcement learning action design |
CN113326385A (en) * | 2021-08-04 | 2021-08-31 | 北京达佳互联信息技术有限公司 | Target multimedia resource acquisition method and device, electronic equipment and storage medium |
CN113326385B (en) * | 2021-08-04 | 2021-12-07 | 北京达佳互联信息技术有限公司 | Target multimedia resource acquisition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108090042A (en) | For identifying the method and apparatus of text subject | |
CN108804512B (en) | Text classification model generation device and method and computer readable storage medium | |
US20160140106A1 (en) | Phrase-based data classification system | |
Zhu et al. | Popularity modeling for mobile apps: A sequential approach | |
CN110390408B (en) | Transaction object prediction method and device | |
US20140129510A1 (en) | Parameter Inference Method, Calculation Apparatus, and System Based on Latent Dirichlet Allocation Model | |
CN106845999A (en) | Risk subscribers recognition methods, device and server | |
CN109740167B (en) | Method and apparatus for generating information | |
CN107871166A (en) | For the characteristic processing method and characteristics processing system of machine learning | |
CN110287409B (en) | Webpage type identification method and device | |
CN107818491A (en) | Electronic installation, Products Show method and storage medium based on user's Internet data | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
WO2019085332A1 (en) | Financial data analysis method, application server, and computer readable storage medium | |
Antonyuk et al. | Medical news aggregation and ranking of taking into account the user needs | |
CN110858226A (en) | Conversation management method and device | |
CN106803092B (en) | Method and device for determining standard problem data | |
Pathak et al. | Adaptive framework for deep learning based dynamic and temporal topic modeling from big data | |
CN109190123A (en) | Method and apparatus for output information | |
CN111221881A (en) | User characteristic data synthesis method and device and electronic equipment | |
CN114117048A (en) | Text classification method and device, computer equipment and storage medium | |
CN111930944B (en) | File label classification method and device | |
CN113763031A (en) | Commodity recommendation method and device, electronic equipment and storage medium | |
JP2014160345A (en) | Browsing action predicting device, browsing action learning device, browsing action predicting method, and browsing action learning method and program | |
CN105808744A (en) | Information prediction method and device | |
CN111859238A (en) | Method and device for predicting data change frequency based on model and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1256148 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180529 |