CN110390092A - Document subject matter determines method and relevant device - Google Patents
Document subject matter determines method and relevant device Download PDFInfo
- Publication number
- CN110390092A CN110390092A CN201810350016.3A CN201810350016A CN110390092A CN 110390092 A CN110390092 A CN 110390092A CN 201810350016 A CN201810350016 A CN 201810350016A CN 110390092 A CN110390092 A CN 110390092A
- Authority
- CN
- China
- Prior art keywords
- document
- neural network
- intermediate parameters
- network model
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Abstract
This application provides a kind of determination methods of document subject matter, destination document can be input in the neural network model that training obtains in advance by this method, to obtain the first intermediate parameters, first intermediate parameters can obtain the second intermediate parameters by probability Distribution Model, and then the second intermediate parameters acquire the topic weights of destination document by rolling over rod algorithm.In addition, present invention also provides a kind of document subject matter determining device and storage medium, to guarantee the application and realization of the method in practice.
Description
Technical field
This application involves text analysis technique fields, more specifically, being that document subject matter determines method and relevant device.
Background technique
Document is a kind of information carrier, and the main contents of its carried information can be determined by analyzing, this is main interior
Hold the theme for being properly termed as document.Document subject matter can be embodied by forming the word frequency of the word of document.For example, a text
Shelves are to tell about economic content, then its theme may be confirmed as " economy ", and " currency ", " finance ", " cost " and
The frequency that words such as " incomes " occur will be very high;For another example, a document tells about war, then its theme may be true
It is set to " war ", and the frequency that the words such as " weapon ", " destruction ", " aircraft " and " tank " occur also can be very high.
Identified theme is of great significance for the analysis etc. of document, it is therefore desirable to a kind of technical solution, for true
Determine the theme that document is included.
Summary of the invention
In view of this, this application provides a kind of document subject matters to determine method, the theme for being included for determining document.
In order to achieve the object, technical solution provided by the present application is as follows:
In a first aspect, this application provides a kind of determination methods of document subject matter, comprising:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining
First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained
Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Second aspect, this application provides a kind of determining devices of document subject matter, comprising:
Document and model obtaining unit, for obtaining destination document and neural network model, the neural network model is used
In the first intermediate parameters for obtaining preset quantity under the limitation of model parameter;
First intermediate parameters obtaining unit is obtained for the destination document to be input in the neural network model
First intermediate parameters;
Probability density function obtains unit, for first intermediate parameters to be input in probability Distribution Model, obtains
The probability density function of second intermediate parameters;
Second intermediate parameters obtaining unit obtains mesh for sampling from the probability density function of second intermediate parameters
Mark the second intermediate parameters;
Topic weights determination unit obtains described for second intermediate parameters of target to be input in folding rod algorithm
The topic weights of destination document.
The third aspect, this application provides a kind of document subject matter locking equipments really, comprising: memory and processor;It is described
Software program, calling storage data in the memory of the processor by operation storage in the memory, at least
Execute following steps:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining
First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained
Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Fourth aspect, this application provides a kind of readable storage medium storing program for executing, are stored thereon with computer program, the computer
When program is executed by processor, the determination method of above-mentioned document subject matter is executed.
From the above technical scheme, this application provides a kind of determination method of document subject matter, this method can be by mesh
Mark document is input in the obtained neural network model of training in advance, and to obtain the first intermediate parameters, the first intermediate parameters can be with
The second intermediate parameters are obtained by probability Distribution Model, and then the second intermediate parameters acquire the master of destination document by rolling over rod algorithm
Inscribe weight.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of flow chart that document subject matter provided by the present application determines method;
Fig. 2 is a kind of structural schematic diagram of neural network provided by the present application;
Fig. 3 is a kind of flow chart of trained topic model provided by the present application;
Fig. 4 is another flow chart of trained topic model provided by the present application;
Fig. 5 is a kind of structural schematic diagram of document subject matter determining device provided by the present application;
Fig. 6 is another structural schematic diagram of document subject matter determining device provided by the present application;
Fig. 7 is the hardware architecture diagram that document subject matter provided by the present application determines equipment.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
The form of expression of document can be a variety of, such as news item, an article etc..The theme of document can be used for table
Show semantic feature expressed by document.In order to determine that topic model can be used in the theme of document, topic model is a kind of mathematics
Statistical model can find abstract theme from document.
Normally, the theme of a document may be not unique, the fixed master of setting inside existing topic model
Number is inscribed, topic model requires to analyze the document of input according to the number, no matter what kind of the document of input is, output
Theme number be it is fixed, subject content it is of low quality.
In this regard, this method can adaptively learn document institute this application provides a kind of method of determining document subject matter
The theme for including, theme number can't limit in advance, but learn out according to the content of document itself.
See Fig. 1, it illustrates a kind of processes that document subject matter determines method, specifically include step S101~S105.
The neural network model that S101: obtaining destination document and training obtains in advance.
Wherein, destination document, that is, document to be identified can be the document that an any one piece includes multiple words.Nerve net
Training obtains network model (can be referred to as neural network) in advance, includes that model parameter (can be in neural network model
Referred to as parameter), the value of model parameter is obtained according to document sample training.
Neural network in the application can be the neural network that multilayer connects entirely, be also possible to the nerve net of other forms
Network, such as multilayer neural network, monolayer neural networks or convolutional neural networks.In addition every layer of neuron number can in neural network
It can also be different with identical.By taking the neural network that multilayer connects entirely as an example, a kind of structure is as shown in Fig. 2, each layer of nerve net
Circle in network indicates neuron, connects the connection weight between the arrow expression neuron of circle.
S102: destination document is input in neural network model, obtains the first intermediate parameters.
It wherein, further include parameter K in neural network model, what parameter K was indicated is the theme of neural network model output
Several maximum values, it should be noted that the maximum value of the theme number is relatively infinitely great value, relatively infinity be compared to
The practical theme number for including of the document artificially assert.For example, artificially assert that the practical theme for including of document can't be more than 10
A, then the maximum value that 100 times of 10 can be determined as to relatively infinitely great value, therefore theme number can be set is 1000.
Certainly, the maximum value of theme number can be set to other numerical value, be not limited to this.
First intermediate parameters are exported as neural network model as a result, the output number of the first intermediate parameters is parameter K
Subtract 1.For example, g (w can be used in neural network model1:N;It ψ) indicates, wherein what ψ was indicated is the parameter of neural network model, will
Destination document W comprising N number of word1:NIt is input in neural network model, available K-1 the first intermediate parameters.
It should be noted that the first intermediate parameters may include it is a variety of, the output numbers of every kind of first intermediate parameters is ginseng
Number K subtract 1.The type of first intermediate parameters and the structure of neural network model are related, defeated using different neural network models
The type of the first intermediate parameters out may also be different.
S103: the first intermediate parameters are input in probability Distribution Model, obtain the probability density letter of the second intermediate parameters
Number.
Wherein, probability Distribution Model is a kind of probability distribution, and a kind of specific example of probability Distribution Model is
Kumaraswamy distribution is a kind of two parameter continuous type distribution being worth in [0,1] range, different parameter values
Under, the density function shape of probability Distribution Model is different.It should be noted that including to be somebody's turn to do there are two parameter in the distribution
Two parameters can be by neural network model output, that is to say, that neural network model can export among two kind first
Parameter, two parameters which is distributed as Kumaraswamy, just using Kumaraswamy distribution
An available probability density function, the probability density function are the probability density functions of another intermediate parameters, in order to
First intermediate parameters above are distinguished, which can be known as the second intermediate parameters.
S104: sampling obtains the second intermediate parameters of target from the probability density function of the second intermediate parameters.
Wherein, probability density function is a kind of probability distribution, can be sampled from probability distribution, and sample mode is to adopt at random
Sample.Sample obtained value i.e. the second intermediate parameters of target.It should be noted that the number of sampling and the number of the first intermediate parameters
Identical, i.e. parameter K subtracts 1.The vector of K-1 dimension can be formed by sampling obtained the second intermediate parameters of target.
S105: the second intermediate parameters of target are input to folding rod algorithm, obtain the topic weights of destination document.
Wherein, since the application is provided with the maximum value K of theme number, that is to say, that need to obtain the weight of K theme.
But the element for including in the second intermediate parameters of target is K-1, therefore can be increased in the vector of the second intermediate parameters of target
One element, the corresponding probability value 1 of the element, the element is as the last one element in the vector.After above-mentioned processing,
The K vector tieed up is input to folding rod algorithm, can obtain K numerical value, this K numerical value, that is, mesh by the vector of available K dimension
Mark the weight of K theme in document.
It should be noted that, although the application can obtain the weight of K theme, K is the actual subject of opposite document
Infinitely great numerical value for number, but the neural network that the application obtains is to train the neural network completed, the explanation to document
Ability is preferable, and the intermediate parameters of output can make the weighted value of certain themes comparatively small, and nearly close to 0, the application can be with
Using the topic weights that the weighted value met certain condition is final as destination document.Since different destination documents is input to nerve
In network, the first obtained intermediate parameters are different, and the topic weights obtained from also can be different, selected final master out
The number and value for inscribing weight are also not the same.As it can be seen that the application can adaptively learn according to the difference of destination document
The theme number and topic weights of destination document.
From the above technical scheme, this application provides a kind of determination method of document subject matter, this method can be by mesh
Mark document is input in the obtained neural network model of training in advance, and to obtain the first intermediate parameters, the first intermediate parameters can be with
The second intermediate parameters are obtained by probability Distribution Model, and then the second intermediate parameters acquire the master of destination document by rolling over rod algorithm
Inscribe weight.After determining topic weights, the corresponding theme of maximum topic weights can be determined as expressed by destination document
Target topic, and then document push is carried out to the interested user of the target topic according to target topic.
The determination method of documents above theme may be considered by topic model realization, that is to say, that topic model is used for
It determines in document comprising weight shared by how many a themes and theme.Topic model includes unknown parameter, is needed using document
Sample is trained unknown parameter, to determine the value of unknown parameter.This process is known as the training process of topic model, below
The process is described in detail.
Topic model assumes that the generating process of document is as follows, that is, assumes to generate in the following way to generate one
Document.
K theme is preset, each theme includes multiple words, and each theme is corresponding with weighted value, weight
What value indicated is theme specific gravity shared in a document, carries out stochastical sampling to theme according to weighted value.Assuming that adopting at random
The first topic of sample is the theme K1, then some words, then stochastical sampling second are sampled out from the corresponding word of theme K1
Theme K2, and some words are sampled out from the corresponding word of theme K2, using multiple words are repeatedly obtained, by multiple word
It combines, a document can be obtained.Assuming that a document generates in the manner described above, war in the document of generation
Theme accounts for 90%, and economic theme accounts for 10%, then the corresponding word quantity of war is certainly corresponding more than love out for sampling
Word quantity.
Document generation expression can be used in above-mentioned generating process, and following formula (1) can be used in document generation
And formula (2) indicates.
Wherein,Indicating the probability distribution of each word in document to be generated, it is a vector, and dimension is word number,
Each word has respective probability;φ indicates theme;The weight of π expression theme;σ is softmax function, is asked for that will weight
Value later normalizes between 0 and 1, and be added and be 1.
About parameter phi it should be noted that vocabulary (vocabulary is referred to as dictionary) can be preset, each
Theme corresponds to a φ, and φ is the vector of a various dimensions, the number of word in dimension, that is, vocabulary of vector, every in vector
What the value of one element indicated is the probability of word appearance in a document in vocabulary.
Wherein, what Multinomial was indicated is multinomial distribution;wiIt indicates according to the probability of word in document to be generated point
ClothEach word that sampling comes out, these words are for forming document.For example, including 20,000 words in vocabulary, then obtain
'sFor the vector of 20,000 dimensions, each element of vector is one and is greater than 0 probability value less than 1 and adds up to 1, such asFor
(0.01,0.2,0.025,0.071 ...), it is a word in vocabulary that each probability value is corresponding, and multinomial distribution can be with
Word is sampled according to these probability values, the word for sampling out can form document.
Weight π in the formula (1) of document generation comprising theme φ and theme.Wherein one kind of topic weights π is set
The mode of setting is that stochastical sampling goes out the weighted value of each theme from GEM distribution.Using one of GEM profile samples the reason is that, In
In one document, the weighted value of each theme of document, which needs to be greater than 0 and is added, is equal to 1, the probability sampled out from GEM distribution
Value can satisfy this requirement.
It should be noted that the application models the generating process of document using GEM distribution, obtain in this manner
The document generation obtained is not restricted for theme number, so that the available any number master of document generation
The topic weights of topic, and then subsequent suitable theme number can be selected according to the characteristic of training data set.The application institute structure
The document generation built describes the generating process of document, includes unknown parameter in document generation, unknown determining
After parameter, a document is input to after document generation can obtain the document probability it is higher, and also meet other
Constraint.
It should be noted that, although above-mentioned distribution stochastical sampling topic weights can be used, but power obtained through stochastical sampling
Directly to be generated according to priori again, can not determine weight obtained through stochastical sampling be accurately, in order to solve this problem, can
Operation is carried out to document sample using neural network to collect document sample and determines topic weights π.
The value needs of theme φ are trained, and topic weights π needs are obtained according to the output valve of neural network, neural network
In parameter similarly need to train, mainly introduce the parameter in neural network below and the parameter in document generation be as
What training.The process is referred to as the training process of topic model.As shown in figure 3, the training process of topic model can be with
Include the following steps S301~S307.
S301: document sample is input in neural network, obtains the first intermediate parameters.
Specifically, input document is document sample, and document sample, which is input in neural network, can produce among first
Following formula (3) expression can be used in parameter, the process.
[a1..., aK-1;b1..., bK-1]=g (w1:N;ψ) (3)
Wherein, g (w1:N;ψ) indicate neural network;The model parameter of ψ expression neural network;W1:NIndicate the word of document sample
Bag model, wherein N indicates the word number in document sample;A and b is the first intermediate parameters;What K was indicated is main in topic model
Inscribe the maximum value of number, it should be noted that the value of K is pre-set.As it can be seen that neural network model can be in model parameter
Limitation under obtain K the first intermediate parameters of default setting.
It should be noted that the model parameter in the neural network is preset value, therefore the neural network can be referred to as
Initial neural network.
S302: it after obtaining the first intermediate parameters, is obtained among second using the first intermediate parameters and probability Distribution Model
Parameter.
Wherein, after obtaining the first intermediate parameters a and b, probability Distribution Model can be used first and obtain the second intermediate parameters
Probability density function.Probability Distribution Model can be but be not limited to Kumaraswamy distribution.
By taking Kumaraswamy is distributed as an example, following formula (4) can be used and obtain the probability density of the second intermediate parameters v
Function.
Wherein, v is a vector, qψ(v|w1:N) indicate vector v probability density function, W1:NIndicate the word of document sample
Bag model, wherein N indicates the word number in document sample, and a and b are the first intermediate parameters.
After obtaining the probability density function of the second intermediate parameters, sampling obtains ginseng among second from probability density function
Number.Specifically, it is to be understood that probability density function indicated is the probability function of the second intermediate parameters, close from the probability
The second intermediate parameters can be obtained with stochastical sampling by spending in function.The process can be expressed as following formula (5).
ν~qψ(ν|w1∶N) (5)
Sample obtained the second intermediate parameters be it is multiple, the theme number maximum value that number is the theme in model subtracts 1, i.e. K-
1.The K-1 the second intermediate parameters can form vector v=[v of K-1 dimension1,v2,...,vK-1]。
S303: the second intermediate parameters are input to folding rod algorithm, obtain the topic weights value of document sample.
Wherein, the second intermediate parameters of K-1 dimension add the second intermediate parameters vK(vKFor the second intermediate parameters for 1) obtaining K dimension
Vector v=[v1,v2,...,vK], then K the second intermediate parameters vector tieed up is input in the formula of folding rod algorithm, obtains text
The topic weights value π of shelves sample.The above process can be expressed as following formula (6).
Folding rod algorithm can more subtly be described as following formula (7).
The wherein π that formula (6) and formula (7) obtain inputs the topic weights value of document.
S304: the topic weights value of document sample is input in document generation, obtains each list in document sample
The probability of occurrence of word, and the generating probability according to probability of occurrence calculating document sample.
Specifically, model represented by the generation model of document, that is, above-mentioned formula (1) and formula (2), parameter phi therein
For datum, theme number therein specifically replaces with K by infinite.Parameter phi is the vector of a multidimensional, dimension and default
Vocabulary in word number it is identical.
The value of parameter phi is initialization when training for the first time, and initiation parameter φ may include various ways.Such as it is a kind of
Mode is that each element of parameter phi is initialized as 0;Another way is that (such as mean value is 0 variance using Gaussian Profile
Gaussian Profile for 0.05) it is sampled, value of the value of sampling as each element of parameter phi.Certainly, the mode of initialization is also
It can be other modes or be such as uniformly distributed using other distributions and sampled.It should be noted that the value of parameter phi is initial
Training early period is carried out after change, the value of period parameters φ is adjusted according to training result after training.
After the value of parameter phi is set, so that it may obtain topic parameter after the value of π is input to formula (1)Then it inputs again
The probability of each word appeared in document sample is obtained into formula (2).After obtaining the probability of each word, by each list
It can be obtained by the generating probability of document sample after the probability multiplication of word.
It should be noted that after folding rod method obtains the topic weights value of each theme of document sample, point of topic weights value
Cloth is input in the generation model of document.It is numerous therefore available countless that folding rod method is meant that rod can have
The weighted value of a theme, and then subsequent suitable theme number can be selected in numerous theme.
It should be noted that generating probability obtained above is obtained based on above-mentioned document generation and neural network model
It arrives.The parameter of document generation and neural network is different, and obtained generating probability is just different, and the generating probability of document is got over
It is high, then it represents that document generation and neural network are higher to the interpretability of document, and the definitive result of document subject matter is more accurate.
Therefore, in order to accurately determine the topic weights of document in subsequent document analysis, it is thus necessary to determine that parameter in neural network
Exact value.
In order to judge the accuracy of the parameter in neural network and the parameter in document generation, target letter can be constructed
Number.The generating probability of the available document sample of step S304, it is to be understood that the generating probability the big, indicates document structure tree
The parameter of model and neural network is more accurate.Therefore obtain the lower limit of generating probability after, can constantly adjusting parameter to give birth to
Become larger at probability, when the value of generating probability is met certain condition, using corresponding parameter as the parameter of training.
Therefore, objective function can be set, what objective function indicated is the topic model parameter (parameter and text of neural network
Shelves generate model parameter) and document structure tree probability between relationship.
S305: judging whether the generating probability of document sample meets the corresponding optimization aim of objective function, if not satisfied, then
Execute step S306;If satisfied, thening follow the steps S307.
Wherein, whether the parameter of document generation and neural network is accurate, can pass through the corresponding optimization of objective function
Target is judged.Specifically, after the generating probability for obtaining some document sample, generating probability can be input to following formula
(8) in the objective function represented by, by verifying whether objective function meets corresponding optimization aim, to verify document structure tree mould
Whether the parameter in type and neural network is accurate.
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document
Word number in sample;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ
(v|w1:N) indicate vector v probability density function;p(W1:N| π, Φ) it indicates to obtain according to document generation and neural network
Document sample generating probability;KL is relative entropy, and expression is that (probability distribution is referred to as probability density to two probability distribution
The distance between function), i.e. probability density function qψ(ν|w1∶NThe distance between) and probability density function p (v | α), α GEM
The lumped parameter of distribution.
It should be noted thatBe negative log likelihood, and what is specifically indicated is to rebuild to miss
Difference;Probability density function p (v | α) it is properly termed as prior probability, probability density function qψ(ν|w1∶N) it is properly termed as posterior probability,
Posterior probability qψ(ν|w1∶N) can also be with the KL divergence (Kullback-Leibler divergence) of prior probability p (v | α)
Referred to as regular terms.Explanation about prior probability p (v | α).Prior probability p (v | α), the parameter of GEM distribution related with GEM distribution
For α, prior probability p (v | α) is that the company of the K probability sampled from Beta (1, α) distribution multiplies.It should be noted that target letter
Regular terms can not included in number, regular terms is to further increase the accuracy that objective function calculates topic model parameter.
The right side of the formula of above-mentioned objective function includes two, and what first item indicated is reconstruction error, and what Section 2 indicated is
The distance between probability distribution and the probability distribution of priori of neural network output.Pass through formula, that is, formula (8) of objective function
It is found that the topic model parameter phi and ψ on the formula left side will affect two on the right of formula values.Subtract each other resulting in two, formula the right
It is worth bigger, then it represents that topic model parameter is more accurate.Objective function has corresponding optimization aim, and optimization aim is on the right of formula
Two are subtracted each other resulting value and can't become larger again, if being unsatisfactory for this optimization aim i.e. two wants that the value subtracted can still become larger,
There is still a need for being trained to neural network, S306 is thened follow the steps to adjust topic model parameter, and return step S301 again
Neural network is trained using new document sample, if meeting optimization aim thens follow the steps S307.Or optimization mesh
It is designated as pre-set frequency of training, if frequency of training reaches default frequency of training, thens follow the steps S307;If training time
Number is unsatisfactory for default frequency of training, thens follow the steps S306 to adjust topic model parameter, and return step S301 is used again
New document sample is trained neural network.
S306: the parameter of neural network and the parameter of document generation, and return step S301 are adjusted.
Wherein, the mode for adjusting topic model parameter can be reversed gradient propagation algorithm, and reversed gradient propagation algorithm can
To calculate how topic model parameter changes, two, formula the right can be made to subtract each other resulting value and become larger, then by topic model
Parameter is adjusted upward toward two can be made on the right of the formula to subtract each other the side that resulting value becomes larger, that is to say, that adjusts topic model
Parameter is so that two, formula the right subtracts each other resulting value and becomes larger.It should be noted that topic model includes that neural network also includes
Document generation, therefore adjusting topic model parameter is the parameter for adjusting neural network and the parameter of document generation.
S307: determining neural network model using the neural network parameter for meeting objective function, using meeting objective function
Document generation parameter determine document generation.
Wherein, if the value of right side of the formula and no longer becoming larger after certain adjusts topic model parameter, in objective function, then stop
It only trains and records neural network parameter at this time and document generation parameter.Neural network parameter is for constructing neural network
Model, document generation parameter is for constructing document generation.The neural network model that training is completed can be used for text
The determination process of shelves theme, to determine the weight of document subject matter according to above-mentioned process shown in FIG. 1.
The above training process can be used Fig. 4 and be illustrated.As shown in figure 4, document sample is input to neural network model
In, corresponding theme power can be obtained according to probability Distribution Model by obtaining K first intermediate parameters v, each first intermediate parameters v
Weight π, the K topic weights π can be used as the input of document generation, each in the available vocabulary of document generation
The probability distribution of word can calculate the generating probability of document sample according to probability distribution, adjust mind finally by objective function
The parameter of parameter and document generation through network makes the generating probability of document sample maximum, and records generating probability maximum
When corresponding neural network parameter and document generation parameter.Neural network parameter is for determining neural network model, document
Model parameter is generated for determining document generation.
The training process of the application can allow topic model to have the energy for automatically selecting theme number according to the document of input
Power, such topic model can more accurately adaptively determine theme number, and the performance of topic model is more preferably.In addition, this
Application makes the optimization process of topic model more flexible by introducing neural network.
In order to facilitate technical solution provided by the present application is understood, it is illustrated below by way of specific example.
Assuming that the word for including in a Training document are as follows: { A, A, B, B, B, C, D, E, E }, one in pre-set dictionary
Altogether comprising there are six words: A, B, C, D, E, F can be by document tables according to the number that the word in document occurs in dictionary
Reach a vector [2,3,1,1,2,0], wherein the numerical value at i-th of position in vector indicates in dictionary at i-th of position
The quantity that word occurs in a document.For example, being 2 at first position in the vector, indicate in dictionary at first position
Word A occurs twice.
Then vector is input in neural network model, the initial parameter value in neural network model is to be randomly provided
, after the neural network model processing by the initial setting up, obtain a series of i.e. a of first intermediate parameters1,...,aK-1,
b1,...,bK-1, the second intermediate parameters v will be obtained in these the first intermediate parameters input probability distributed modelsi, by these in second
Between parameter viIt is input in folding rod algorithm, obtains the topic weights value π of document samplei, by the topic weights value π of document sampleiIt is defeated
Enter to including topic parameter φiDocument generation in, obtain the document current neural network model parameter and
Probability distribution under topic parameter.
The generating probability of document is obtained according to the probability distribution, obtains such as formula (8) corresponding objective function and target letter
The corresponding optimization aim of number, it is current to judge according to whether the generating probability of document meets the corresponding optimization aim of objective function
Whether the parameter and topic parameter of neural network model are accurate.If inaccurate, need to optimize nerve using gradient descent method
The parameter and topic parameter of network model.Optimization aim can be frequency of training condition be also possible to objective function value it is no longer bright
It is aobvious to become smaller.If meeting optimization aim, deconditioning makes to obtain the parameter of neural network model and the value of topic parameter
Neural network model and document generation can be determined with the value of identified parameter.
After obtaining neural network model and document generation, the topic weights of unknown document can be carried out true
It is fixed.For example, assuming that the vector of a unknown document is expressed as [1,1,2,3,2,0], which is input to the nerve net after training
The first intermediate parameters are obtained in network model, then ginseng among second will be obtained in these the first intermediate parameters input probability distributed models
Number, finally by these second intermediate parameters viIt is input in folding rod algorithm, obtains the topic weights value of document sample.
Present invention also provides a kind of determining devices of document subject matter.As shown in figure 5, the device specifically includes: document and
Model obtaining unit 501, the first intermediate parameters obtaining unit 502, probability density function obtain unit 503, the second intermediate parameters
Obtaining unit 504 and topic weights determination unit 505.
Document and model obtaining unit 501, for obtaining destination document and neural network model, the neural network model
For obtaining the first intermediate parameters of preset quantity under the limitation of model parameter;
First intermediate parameters obtaining unit 502 is obtained for the destination document to be input in the neural network model
To the first intermediate parameters;
Probability density function obtains unit 503, for first intermediate parameters to be input in probability Distribution Model, obtains
To the probability density function of the second intermediate parameters;
Second intermediate parameters obtaining unit 504, for being sampled from the probability density function of second intermediate parameters
To the second intermediate parameters of target;
Topic weights determination unit 505 obtains institute for second intermediate parameters of target to be input in folding rod algorithm
State the topic weights of destination document.
See Fig. 6, present invention also provides the determining devices of another document subject matter.As shown in fig. 6, the document theme is really
Determine device it is shown in Fig. 5 on the basis of, can further include: neural network model training unit 506.
Neural network model training unit 506, for training neural network model;
Wherein, the neural network model training unit 506 is specifically used for:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network is obtained
Second intermediate parameters of model;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the master of the document sample is obtained
Inscribe weighted value;
The topic weights value of the document sample is input in document generation, is obtained each in the document sample
The probability of occurrence of word, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the mould of the neural network model is adjusted
The parameter of shape parameter and the document generation, and return to the nerve net that new document sample is input to adjustment model parameter
In network model, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network mould of model parameter will be adjusted
The neural network model that type is obtained as training.
In one example, the corresponding formula of the objective function are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document
Word number in sample;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ
(v|w1:N) indicate vector v probability density function;p(W1:N| π, Φ) it indicates to obtain according to document generation and neural network
Document sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v |
The distance between α), α is the lumped parameter of GEM distribution;
Correspondingly, the goal satisfaction judgment sub-unit is specifically used for: judging the generating probability input of the document sample
Whether the value obtained after to the corresponding formula of the objective function meets preset trained termination condition.
In one example, the neural network model training unit is used to adjust the model ginseng of the neural network model
The parameter of the several and described document generation, comprising:
Neural network model training unit is specifically used for using reversed gradient propagation algorithm, adjusts the neural network mould
The parameter of the model parameter of type and the document generation.
In one example, the topic weights determination unit is used to second intermediate parameters of target being input to folding rod
In algorithm, the topic weights of the destination document are obtained, comprising:
Topic weights determination unit is obtained specifically for second intermediate parameters of target are input in folding rod algorithm
Each alternative topic weights;And the topic weights of preset condition will be met as the topic weights of the destination document.
See Fig. 7, a hardware structural diagram of equipment is determined for document subject matter provided by the present application.See Fig. 7, this sets
Standby may include: processor 701, memory 702 and communication bus 703.
Wherein, processor 701, memory 702 complete mutual communication by communication bus 703.
Processor 701, for executing program, program may include program code, and said program code includes processor
Operational order.Wherein, program can be specifically used for:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining
First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained
Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Processor 701 may be a central processor CPU or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present application
Road.
Memory 702, for storing program;Memory 702 may include high speed RAM memory, it is also possible to further include non-
Volatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Present invention also provides a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing is stored with a plurality of instruction, the finger
It enables and being loaded suitable for processor, to execute the above step related to the determination method of document subject matter.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight
Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including above-mentioned element.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (12)
1. a kind of determination method of document subject matter characterized by comprising
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for being preset
First intermediate parameters of quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density function of the second intermediate parameters is obtained;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
2. the determination method of document subject matter according to claim 1, which is characterized in that the training of the neural network model
Process includes:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network model is obtained
The second intermediate parameters;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the theme power of the document sample is obtained
Weight values;
The topic weights value of the document sample is input in document generation, each word in the document sample is obtained
Probability of occurrence, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the model ginseng of the neural network model is adjusted
The parameter of the several and described document generation, and return to the neural network mould that new document sample is input to adjustment model parameter
In type, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network model for adjusting model parameter is made
The neural network model obtained for training.
3. the determination method of document subject matter according to claim 2, which is characterized in that the corresponding formula of the objective function
Are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document sample
In word number;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ(v|
w1:N) indicate vector v probability density function;p(W1:N| π, Φ) indicate the text obtained according to document generation and neural network
Shelves sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v | α) it
Between distance, α be GEM distribution lumped parameter;
Correspondingly, whether the generating probability for judging the document sample meets the optimization aim, comprising:
Judge that the generating probability of the document sample is input to whether the value obtained after the corresponding formula of the objective function meets
Preset trained termination condition.
4. the determination method of document subject matter according to claim 3, which is characterized in that the adjustment neural network mould
The parameter of the model parameter of type and the document generation, comprising:
Using reversed gradient propagation algorithm, the model parameter of the neural network model and the ginseng of the document generation are adjusted
Number.
5. the determination method of document subject matter according to claim 1, which is characterized in that it is described will be among the target second
Parameter is input in folding rod algorithm, obtains the topic weights of the destination document, comprising:
Second intermediate parameters of target are input in folding rod algorithm, each alternative topic weights are obtained;
The topic weights of preset condition will be met as the topic weights of the destination document.
6. a kind of determining device of document subject matter characterized by comprising
Document and model obtaining unit, for obtaining destination document and neural network model, the neural network model is used for
The first intermediate parameters of preset quantity are obtained under the limitation of model parameter;
First intermediate parameters obtaining unit obtains first for the destination document to be input in the neural network model
Intermediate parameters;
Probability density function obtains unit, for first intermediate parameters to be input in probability Distribution Model, obtains second
The probability density function of intermediate parameters;
Second intermediate parameters obtaining unit, for from the probability density function of second intermediate parameters sampling obtain target the
Two intermediate parameters;
Topic weights determination unit obtains the target for second intermediate parameters of target to be input in folding rod algorithm
The topic weights of document.
7. the determining device of document subject matter according to claim 6, which is characterized in that further include:
Neural network model training unit, for training neural network model;
Wherein, the neural network model training unit is specifically used for:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network model is obtained
The second intermediate parameters;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the theme power of the document sample is obtained
Weight values;
The topic weights value of the document sample is input in document generation, each word in the document sample is obtained
Probability of occurrence, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the model ginseng of the neural network model is adjusted
The parameter of the several and described document generation, and return to the neural network mould that new document sample is input to adjustment model parameter
In type, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network model for adjusting model parameter is made
The neural network model obtained for training.
8. the determining device of document subject matter according to claim 7, which is characterized in that the corresponding formula of the objective function
Are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document sample
In word number;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ(v|
w1:N) indicate vector v probability density function;p(W1:N| π, Φ) indicate the text obtained according to document generation and neural network
Shelves sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v | α) it
Between distance, α be GEM distribution lumped parameter;
Correspondingly, the goal satisfaction judgment sub-unit is specifically used for: judging that the generating probability of the document sample is input to institute
State whether the value obtained after the corresponding formula of objective function meets preset trained termination condition.
9. the determining device of document subject matter according to claim 8, which is characterized in that the neural network model training is single
Member is for adjusting the model parameter of the neural network model and the parameter of the document generation, comprising:
Neural network model training unit is specifically used for using reversed gradient propagation algorithm, adjusts the neural network model
The parameter of model parameter and the document generation.
10. the determining device of document subject matter according to claim 6, which is characterized in that the topic weights determination unit
For second intermediate parameters of target to be input in folding rod algorithm, the topic weights of the destination document are obtained, comprising:
Topic weights determination unit obtains each specifically for second intermediate parameters of target are input in folding rod algorithm
Alternative topic weights;And the topic weights of preset condition will be met as the topic weights of the destination document.
11. a kind of document subject matter locking equipment really characterized by comprising memory and processor;The processor passes through fortune
Software program, the data of calling storage in the memory, at least execution following steps of row storage in the memory:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for being preset
First intermediate parameters of quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density function of the second intermediate parameters is obtained;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
12. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed
When device executes, the determination method such as document subject matter described in any one of claim 1 to 5 is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810350016.3A CN110390092A (en) | 2018-04-18 | 2018-04-18 | Document subject matter determines method and relevant device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810350016.3A CN110390092A (en) | 2018-04-18 | 2018-04-18 | Document subject matter determines method and relevant device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110390092A true CN110390092A (en) | 2019-10-29 |
Family
ID=68283322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810350016.3A Pending CN110390092A (en) | 2018-04-18 | 2018-04-18 | Document subject matter determines method and relevant device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390092A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN105868178A (en) * | 2016-03-28 | 2016-08-17 | 浙江大学 | Multi-document automatic abstract generation method based on phrase subject modeling |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN107423439A (en) * | 2017-08-04 | 2017-12-01 | 逸途(北京)科技有限公司 | A kind of Chinese charater problem mapping method based on LDA |
-
2018
- 2018-04-18 CN CN201810350016.3A patent/CN110390092A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103942340A (en) * | 2014-05-09 | 2014-07-23 | 电子科技大学 | Microblog user interest recognizing method based on text mining |
CN105868178A (en) * | 2016-03-28 | 2016-08-17 | 浙江大学 | Multi-document automatic abstract generation method based on phrase subject modeling |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN107423439A (en) * | 2017-08-04 | 2017-12-01 | 逸途(北京)科技有限公司 | A kind of Chinese charater problem mapping method based on LDA |
Non-Patent Citations (1)
Title |
---|
XUEFEI NING等: "A BAYESIAN NONPARAMETRIC TOPIC MODEL WITH VARIATIONAL AUTO-ENCODERS", 《HTTPS://OPENREVIEW.NET/FORUM?ID=SKXQZNGC-》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Identification of core-periphery structure in networks | |
CN108304679A (en) | A kind of adaptive reliability analysis method | |
CN112529153B (en) | BERT model fine tuning method and device based on convolutional neural network | |
CN108052387B (en) | Resource allocation prediction method and system in mobile cloud computing | |
CN102075352A (en) | Method and device for predicting network user behavior | |
CN104794367B (en) | Medical treatment resource scoring based on hidden semantic model is with recommending method | |
Cai et al. | Network linear discriminant analysis | |
Thiem | Mill's Methods, Induction, and Case Sensitivity in Qualitative Comparative Analysis: A Comment on Hug (2013). | |
CN112163671A (en) | New energy scene generation method and system | |
Chikushi et al. | Using spectral entropy and bernoulli map to handle concept drift | |
Levashenko et al. | Fuzzy classifier based on fuzzy decision tree | |
Bruzzese et al. | DESPOTA: DEndrogram slicing through a pemutation test approach | |
CN110197252A (en) | Deep learning based on distance | |
CN110046344A (en) | Add the method and terminal device of separator | |
Ogden | A sequential reduction method for inference in generalized linear mixed models | |
Liu | Time Series Forecasting Based on ARIMA and LSTM | |
JP6099099B2 (en) | Convergence determination apparatus, method, and program | |
CN110390092A (en) | Document subject matter determines method and relevant device | |
Pan et al. | A sequential addressing subsampling method for massive data analysis under memory constraint | |
Televnoy et al. | Neural ODE machine learning method with embedded numerical method | |
Wang | BP network implementation based on computer MATLAB neural network toolbox | |
Ge et al. | Dependency between degree of fit and input noise in fuzzy linear regression using non-symmetric fuzzy triangular coefficients | |
CN114186646A (en) | Block chain abnormal transaction identification method and device, storage medium and electronic equipment | |
Dodson et al. | Iterative approximation of statistical distributions and relation to information geometry | |
Wang et al. | Probability density function estimation based on representative data samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20221202 Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518000 Applicant after: Shenzhen Yayue Technology Co.,Ltd. Address before: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors Applicant before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |