CN110390092A - Document subject matter determines method and relevant device - Google Patents

Document subject matter determines method and relevant device Download PDF

Info

Publication number
CN110390092A
CN110390092A CN201810350016.3A CN201810350016A CN110390092A CN 110390092 A CN110390092 A CN 110390092A CN 201810350016 A CN201810350016 A CN 201810350016A CN 110390092 A CN110390092 A CN 110390092A
Authority
CN
China
Prior art keywords
document
neural network
intermediate parameters
network model
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810350016.3A
Other languages
Chinese (zh)
Inventor
郑胤
黄俊洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yayue Technology Co ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810350016.3A priority Critical patent/CN110390092A/en
Publication of CN110390092A publication Critical patent/CN110390092A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

This application provides a kind of determination methods of document subject matter, destination document can be input in the neural network model that training obtains in advance by this method, to obtain the first intermediate parameters, first intermediate parameters can obtain the second intermediate parameters by probability Distribution Model, and then the second intermediate parameters acquire the topic weights of destination document by rolling over rod algorithm.In addition, present invention also provides a kind of document subject matter determining device and storage medium, to guarantee the application and realization of the method in practice.

Description

Document subject matter determines method and relevant device
Technical field
This application involves text analysis technique fields, more specifically, being that document subject matter determines method and relevant device.
Background technique
Document is a kind of information carrier, and the main contents of its carried information can be determined by analyzing, this is main interior Hold the theme for being properly termed as document.Document subject matter can be embodied by forming the word frequency of the word of document.For example, a text Shelves are to tell about economic content, then its theme may be confirmed as " economy ", and " currency ", " finance ", " cost " and The frequency that words such as " incomes " occur will be very high;For another example, a document tells about war, then its theme may be true It is set to " war ", and the frequency that the words such as " weapon ", " destruction ", " aircraft " and " tank " occur also can be very high.
Identified theme is of great significance for the analysis etc. of document, it is therefore desirable to a kind of technical solution, for true Determine the theme that document is included.
Summary of the invention
In view of this, this application provides a kind of document subject matters to determine method, the theme for being included for determining document.
In order to achieve the object, technical solution provided by the present application is as follows:
In a first aspect, this application provides a kind of determination methods of document subject matter, comprising:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Second aspect, this application provides a kind of determining devices of document subject matter, comprising:
Document and model obtaining unit, for obtaining destination document and neural network model, the neural network model is used In the first intermediate parameters for obtaining preset quantity under the limitation of model parameter;
First intermediate parameters obtaining unit is obtained for the destination document to be input in the neural network model First intermediate parameters;
Probability density function obtains unit, for first intermediate parameters to be input in probability Distribution Model, obtains The probability density function of second intermediate parameters;
Second intermediate parameters obtaining unit obtains mesh for sampling from the probability density function of second intermediate parameters Mark the second intermediate parameters;
Topic weights determination unit obtains described for second intermediate parameters of target to be input in folding rod algorithm The topic weights of destination document.
The third aspect, this application provides a kind of document subject matter locking equipments really, comprising: memory and processor;It is described Software program, calling storage data in the memory of the processor by operation storage in the memory, at least Execute following steps:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Fourth aspect, this application provides a kind of readable storage medium storing program for executing, are stored thereon with computer program, the computer When program is executed by processor, the determination method of above-mentioned document subject matter is executed.
From the above technical scheme, this application provides a kind of determination method of document subject matter, this method can be by mesh Mark document is input in the obtained neural network model of training in advance, and to obtain the first intermediate parameters, the first intermediate parameters can be with The second intermediate parameters are obtained by probability Distribution Model, and then the second intermediate parameters acquire the master of destination document by rolling over rod algorithm Inscribe weight.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of flow chart that document subject matter provided by the present application determines method;
Fig. 2 is a kind of structural schematic diagram of neural network provided by the present application;
Fig. 3 is a kind of flow chart of trained topic model provided by the present application;
Fig. 4 is another flow chart of trained topic model provided by the present application;
Fig. 5 is a kind of structural schematic diagram of document subject matter determining device provided by the present application;
Fig. 6 is another structural schematic diagram of document subject matter determining device provided by the present application;
Fig. 7 is the hardware architecture diagram that document subject matter provided by the present application determines equipment.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
The form of expression of document can be a variety of, such as news item, an article etc..The theme of document can be used for table Show semantic feature expressed by document.In order to determine that topic model can be used in the theme of document, topic model is a kind of mathematics Statistical model can find abstract theme from document.
Normally, the theme of a document may be not unique, the fixed master of setting inside existing topic model Number is inscribed, topic model requires to analyze the document of input according to the number, no matter what kind of the document of input is, output Theme number be it is fixed, subject content it is of low quality.
In this regard, this method can adaptively learn document institute this application provides a kind of method of determining document subject matter The theme for including, theme number can't limit in advance, but learn out according to the content of document itself.
See Fig. 1, it illustrates a kind of processes that document subject matter determines method, specifically include step S101~S105.
The neural network model that S101: obtaining destination document and training obtains in advance.
Wherein, destination document, that is, document to be identified can be the document that an any one piece includes multiple words.Nerve net Training obtains network model (can be referred to as neural network) in advance, includes that model parameter (can be in neural network model Referred to as parameter), the value of model parameter is obtained according to document sample training.
Neural network in the application can be the neural network that multilayer connects entirely, be also possible to the nerve net of other forms Network, such as multilayer neural network, monolayer neural networks or convolutional neural networks.In addition every layer of neuron number can in neural network It can also be different with identical.By taking the neural network that multilayer connects entirely as an example, a kind of structure is as shown in Fig. 2, each layer of nerve net Circle in network indicates neuron, connects the connection weight between the arrow expression neuron of circle.
S102: destination document is input in neural network model, obtains the first intermediate parameters.
It wherein, further include parameter K in neural network model, what parameter K was indicated is the theme of neural network model output Several maximum values, it should be noted that the maximum value of the theme number is relatively infinitely great value, relatively infinity be compared to The practical theme number for including of the document artificially assert.For example, artificially assert that the practical theme for including of document can't be more than 10 A, then the maximum value that 100 times of 10 can be determined as to relatively infinitely great value, therefore theme number can be set is 1000. Certainly, the maximum value of theme number can be set to other numerical value, be not limited to this.
First intermediate parameters are exported as neural network model as a result, the output number of the first intermediate parameters is parameter K Subtract 1.For example, g (w can be used in neural network model1:N;It ψ) indicates, wherein what ψ was indicated is the parameter of neural network model, will Destination document W comprising N number of word1:NIt is input in neural network model, available K-1 the first intermediate parameters.
It should be noted that the first intermediate parameters may include it is a variety of, the output numbers of every kind of first intermediate parameters is ginseng Number K subtract 1.The type of first intermediate parameters and the structure of neural network model are related, defeated using different neural network models The type of the first intermediate parameters out may also be different.
S103: the first intermediate parameters are input in probability Distribution Model, obtain the probability density letter of the second intermediate parameters Number.
Wherein, probability Distribution Model is a kind of probability distribution, and a kind of specific example of probability Distribution Model is Kumaraswamy distribution is a kind of two parameter continuous type distribution being worth in [0,1] range, different parameter values Under, the density function shape of probability Distribution Model is different.It should be noted that including to be somebody's turn to do there are two parameter in the distribution Two parameters can be by neural network model output, that is to say, that neural network model can export among two kind first Parameter, two parameters which is distributed as Kumaraswamy, just using Kumaraswamy distribution An available probability density function, the probability density function are the probability density functions of another intermediate parameters, in order to First intermediate parameters above are distinguished, which can be known as the second intermediate parameters.
S104: sampling obtains the second intermediate parameters of target from the probability density function of the second intermediate parameters.
Wherein, probability density function is a kind of probability distribution, can be sampled from probability distribution, and sample mode is to adopt at random Sample.Sample obtained value i.e. the second intermediate parameters of target.It should be noted that the number of sampling and the number of the first intermediate parameters Identical, i.e. parameter K subtracts 1.The vector of K-1 dimension can be formed by sampling obtained the second intermediate parameters of target.
S105: the second intermediate parameters of target are input to folding rod algorithm, obtain the topic weights of destination document.
Wherein, since the application is provided with the maximum value K of theme number, that is to say, that need to obtain the weight of K theme. But the element for including in the second intermediate parameters of target is K-1, therefore can be increased in the vector of the second intermediate parameters of target One element, the corresponding probability value 1 of the element, the element is as the last one element in the vector.After above-mentioned processing, The K vector tieed up is input to folding rod algorithm, can obtain K numerical value, this K numerical value, that is, mesh by the vector of available K dimension Mark the weight of K theme in document.
It should be noted that, although the application can obtain the weight of K theme, K is the actual subject of opposite document Infinitely great numerical value for number, but the neural network that the application obtains is to train the neural network completed, the explanation to document Ability is preferable, and the intermediate parameters of output can make the weighted value of certain themes comparatively small, and nearly close to 0, the application can be with Using the topic weights that the weighted value met certain condition is final as destination document.Since different destination documents is input to nerve In network, the first obtained intermediate parameters are different, and the topic weights obtained from also can be different, selected final master out The number and value for inscribing weight are also not the same.As it can be seen that the application can adaptively learn according to the difference of destination document The theme number and topic weights of destination document.
From the above technical scheme, this application provides a kind of determination method of document subject matter, this method can be by mesh Mark document is input in the obtained neural network model of training in advance, and to obtain the first intermediate parameters, the first intermediate parameters can be with The second intermediate parameters are obtained by probability Distribution Model, and then the second intermediate parameters acquire the master of destination document by rolling over rod algorithm Inscribe weight.After determining topic weights, the corresponding theme of maximum topic weights can be determined as expressed by destination document Target topic, and then document push is carried out to the interested user of the target topic according to target topic.
The determination method of documents above theme may be considered by topic model realization, that is to say, that topic model is used for It determines in document comprising weight shared by how many a themes and theme.Topic model includes unknown parameter, is needed using document Sample is trained unknown parameter, to determine the value of unknown parameter.This process is known as the training process of topic model, below The process is described in detail.
Topic model assumes that the generating process of document is as follows, that is, assumes to generate in the following way to generate one Document.
K theme is preset, each theme includes multiple words, and each theme is corresponding with weighted value, weight What value indicated is theme specific gravity shared in a document, carries out stochastical sampling to theme according to weighted value.Assuming that adopting at random The first topic of sample is the theme K1, then some words, then stochastical sampling second are sampled out from the corresponding word of theme K1 Theme K2, and some words are sampled out from the corresponding word of theme K2, using multiple words are repeatedly obtained, by multiple word It combines, a document can be obtained.Assuming that a document generates in the manner described above, war in the document of generation Theme accounts for 90%, and economic theme accounts for 10%, then the corresponding word quantity of war is certainly corresponding more than love out for sampling Word quantity.
Document generation expression can be used in above-mentioned generating process, and following formula (1) can be used in document generation And formula (2) indicates.
Wherein,Indicating the probability distribution of each word in document to be generated, it is a vector, and dimension is word number, Each word has respective probability;φ indicates theme;The weight of π expression theme;σ is softmax function, is asked for that will weight Value later normalizes between 0 and 1, and be added and be 1.
About parameter phi it should be noted that vocabulary (vocabulary is referred to as dictionary) can be preset, each Theme corresponds to a φ, and φ is the vector of a various dimensions, the number of word in dimension, that is, vocabulary of vector, every in vector What the value of one element indicated is the probability of word appearance in a document in vocabulary.
Wherein, what Multinomial was indicated is multinomial distribution;wiIt indicates according to the probability of word in document to be generated point ClothEach word that sampling comes out, these words are for forming document.For example, including 20,000 words in vocabulary, then obtain 'sFor the vector of 20,000 dimensions, each element of vector is one and is greater than 0 probability value less than 1 and adds up to 1, such asFor (0.01,0.2,0.025,0.071 ...), it is a word in vocabulary that each probability value is corresponding, and multinomial distribution can be with Word is sampled according to these probability values, the word for sampling out can form document.
Weight π in the formula (1) of document generation comprising theme φ and theme.Wherein one kind of topic weights π is set The mode of setting is that stochastical sampling goes out the weighted value of each theme from GEM distribution.Using one of GEM profile samples the reason is that, In In one document, the weighted value of each theme of document, which needs to be greater than 0 and is added, is equal to 1, the probability sampled out from GEM distribution Value can satisfy this requirement.
It should be noted that the application models the generating process of document using GEM distribution, obtain in this manner The document generation obtained is not restricted for theme number, so that the available any number master of document generation The topic weights of topic, and then subsequent suitable theme number can be selected according to the characteristic of training data set.The application institute structure The document generation built describes the generating process of document, includes unknown parameter in document generation, unknown determining After parameter, a document is input to after document generation can obtain the document probability it is higher, and also meet other Constraint.
It should be noted that, although above-mentioned distribution stochastical sampling topic weights can be used, but power obtained through stochastical sampling Directly to be generated according to priori again, can not determine weight obtained through stochastical sampling be accurately, in order to solve this problem, can Operation is carried out to document sample using neural network to collect document sample and determines topic weights π.
The value needs of theme φ are trained, and topic weights π needs are obtained according to the output valve of neural network, neural network In parameter similarly need to train, mainly introduce the parameter in neural network below and the parameter in document generation be as What training.The process is referred to as the training process of topic model.As shown in figure 3, the training process of topic model can be with Include the following steps S301~S307.
S301: document sample is input in neural network, obtains the first intermediate parameters.
Specifically, input document is document sample, and document sample, which is input in neural network, can produce among first Following formula (3) expression can be used in parameter, the process.
[a1..., aK-1;b1..., bK-1]=g (w1:N;ψ) (3)
Wherein, g (w1:N;ψ) indicate neural network;The model parameter of ψ expression neural network;W1:NIndicate the word of document sample Bag model, wherein N indicates the word number in document sample;A and b is the first intermediate parameters;What K was indicated is main in topic model Inscribe the maximum value of number, it should be noted that the value of K is pre-set.As it can be seen that neural network model can be in model parameter Limitation under obtain K the first intermediate parameters of default setting.
It should be noted that the model parameter in the neural network is preset value, therefore the neural network can be referred to as Initial neural network.
S302: it after obtaining the first intermediate parameters, is obtained among second using the first intermediate parameters and probability Distribution Model Parameter.
Wherein, after obtaining the first intermediate parameters a and b, probability Distribution Model can be used first and obtain the second intermediate parameters Probability density function.Probability Distribution Model can be but be not limited to Kumaraswamy distribution.
By taking Kumaraswamy is distributed as an example, following formula (4) can be used and obtain the probability density of the second intermediate parameters v Function.
Wherein, v is a vector, qψ(v|w1:N) indicate vector v probability density function, W1:NIndicate the word of document sample Bag model, wherein N indicates the word number in document sample, and a and b are the first intermediate parameters.
After obtaining the probability density function of the second intermediate parameters, sampling obtains ginseng among second from probability density function Number.Specifically, it is to be understood that probability density function indicated is the probability function of the second intermediate parameters, close from the probability The second intermediate parameters can be obtained with stochastical sampling by spending in function.The process can be expressed as following formula (5).
ν~qψ(ν|w1∶N) (5)
Sample obtained the second intermediate parameters be it is multiple, the theme number maximum value that number is the theme in model subtracts 1, i.e. K- 1.The K-1 the second intermediate parameters can form vector v=[v of K-1 dimension1,v2,...,vK-1]。
S303: the second intermediate parameters are input to folding rod algorithm, obtain the topic weights value of document sample.
Wherein, the second intermediate parameters of K-1 dimension add the second intermediate parameters vK(vKFor the second intermediate parameters for 1) obtaining K dimension Vector v=[v1,v2,...,vK], then K the second intermediate parameters vector tieed up is input in the formula of folding rod algorithm, obtains text The topic weights value π of shelves sample.The above process can be expressed as following formula (6).
Folding rod algorithm can more subtly be described as following formula (7).
The wherein π that formula (6) and formula (7) obtain inputs the topic weights value of document.
S304: the topic weights value of document sample is input in document generation, obtains each list in document sample The probability of occurrence of word, and the generating probability according to probability of occurrence calculating document sample.
Specifically, model represented by the generation model of document, that is, above-mentioned formula (1) and formula (2), parameter phi therein For datum, theme number therein specifically replaces with K by infinite.Parameter phi is the vector of a multidimensional, dimension and default Vocabulary in word number it is identical.
The value of parameter phi is initialization when training for the first time, and initiation parameter φ may include various ways.Such as it is a kind of Mode is that each element of parameter phi is initialized as 0;Another way is that (such as mean value is 0 variance using Gaussian Profile Gaussian Profile for 0.05) it is sampled, value of the value of sampling as each element of parameter phi.Certainly, the mode of initialization is also It can be other modes or be such as uniformly distributed using other distributions and sampled.It should be noted that the value of parameter phi is initial Training early period is carried out after change, the value of period parameters φ is adjusted according to training result after training.
After the value of parameter phi is set, so that it may obtain topic parameter after the value of π is input to formula (1)Then it inputs again The probability of each word appeared in document sample is obtained into formula (2).After obtaining the probability of each word, by each list It can be obtained by the generating probability of document sample after the probability multiplication of word.
It should be noted that after folding rod method obtains the topic weights value of each theme of document sample, point of topic weights value Cloth is input in the generation model of document.It is numerous therefore available countless that folding rod method is meant that rod can have The weighted value of a theme, and then subsequent suitable theme number can be selected in numerous theme.
It should be noted that generating probability obtained above is obtained based on above-mentioned document generation and neural network model It arrives.The parameter of document generation and neural network is different, and obtained generating probability is just different, and the generating probability of document is got over It is high, then it represents that document generation and neural network are higher to the interpretability of document, and the definitive result of document subject matter is more accurate. Therefore, in order to accurately determine the topic weights of document in subsequent document analysis, it is thus necessary to determine that parameter in neural network Exact value.
In order to judge the accuracy of the parameter in neural network and the parameter in document generation, target letter can be constructed Number.The generating probability of the available document sample of step S304, it is to be understood that the generating probability the big, indicates document structure tree The parameter of model and neural network is more accurate.Therefore obtain the lower limit of generating probability after, can constantly adjusting parameter to give birth to Become larger at probability, when the value of generating probability is met certain condition, using corresponding parameter as the parameter of training.
Therefore, objective function can be set, what objective function indicated is the topic model parameter (parameter and text of neural network Shelves generate model parameter) and document structure tree probability between relationship.
S305: judging whether the generating probability of document sample meets the corresponding optimization aim of objective function, if not satisfied, then Execute step S306;If satisfied, thening follow the steps S307.
Wherein, whether the parameter of document generation and neural network is accurate, can pass through the corresponding optimization of objective function Target is judged.Specifically, after the generating probability for obtaining some document sample, generating probability can be input to following formula (8) in the objective function represented by, by verifying whether objective function meets corresponding optimization aim, to verify document structure tree mould Whether the parameter in type and neural network is accurate.
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document Word number in sample;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ (v|w1:N) indicate vector v probability density function;p(W1:N| π, Φ) it indicates to obtain according to document generation and neural network Document sample generating probability;KL is relative entropy, and expression is that (probability distribution is referred to as probability density to two probability distribution The distance between function), i.e. probability density function qψ(ν|w1∶NThe distance between) and probability density function p (v | α), α GEM The lumped parameter of distribution.
It should be noted thatBe negative log likelihood, and what is specifically indicated is to rebuild to miss Difference;Probability density function p (v | α) it is properly termed as prior probability, probability density function qψ(ν|w1∶N) it is properly termed as posterior probability, Posterior probability qψ(ν|w1∶N) can also be with the KL divergence (Kullback-Leibler divergence) of prior probability p (v | α) Referred to as regular terms.Explanation about prior probability p (v | α).Prior probability p (v | α), the parameter of GEM distribution related with GEM distribution For α, prior probability p (v | α) is that the company of the K probability sampled from Beta (1, α) distribution multiplies.It should be noted that target letter Regular terms can not included in number, regular terms is to further increase the accuracy that objective function calculates topic model parameter.
The right side of the formula of above-mentioned objective function includes two, and what first item indicated is reconstruction error, and what Section 2 indicated is The distance between probability distribution and the probability distribution of priori of neural network output.Pass through formula, that is, formula (8) of objective function It is found that the topic model parameter phi and ψ on the formula left side will affect two on the right of formula values.Subtract each other resulting in two, formula the right It is worth bigger, then it represents that topic model parameter is more accurate.Objective function has corresponding optimization aim, and optimization aim is on the right of formula Two are subtracted each other resulting value and can't become larger again, if being unsatisfactory for this optimization aim i.e. two wants that the value subtracted can still become larger, There is still a need for being trained to neural network, S306 is thened follow the steps to adjust topic model parameter, and return step S301 again Neural network is trained using new document sample, if meeting optimization aim thens follow the steps S307.Or optimization mesh It is designated as pre-set frequency of training, if frequency of training reaches default frequency of training, thens follow the steps S307;If training time Number is unsatisfactory for default frequency of training, thens follow the steps S306 to adjust topic model parameter, and return step S301 is used again New document sample is trained neural network.
S306: the parameter of neural network and the parameter of document generation, and return step S301 are adjusted.
Wherein, the mode for adjusting topic model parameter can be reversed gradient propagation algorithm, and reversed gradient propagation algorithm can To calculate how topic model parameter changes, two, formula the right can be made to subtract each other resulting value and become larger, then by topic model Parameter is adjusted upward toward two can be made on the right of the formula to subtract each other the side that resulting value becomes larger, that is to say, that adjusts topic model Parameter is so that two, formula the right subtracts each other resulting value and becomes larger.It should be noted that topic model includes that neural network also includes Document generation, therefore adjusting topic model parameter is the parameter for adjusting neural network and the parameter of document generation.
S307: determining neural network model using the neural network parameter for meeting objective function, using meeting objective function Document generation parameter determine document generation.
Wherein, if the value of right side of the formula and no longer becoming larger after certain adjusts topic model parameter, in objective function, then stop It only trains and records neural network parameter at this time and document generation parameter.Neural network parameter is for constructing neural network Model, document generation parameter is for constructing document generation.The neural network model that training is completed can be used for text The determination process of shelves theme, to determine the weight of document subject matter according to above-mentioned process shown in FIG. 1.
The above training process can be used Fig. 4 and be illustrated.As shown in figure 4, document sample is input to neural network model In, corresponding theme power can be obtained according to probability Distribution Model by obtaining K first intermediate parameters v, each first intermediate parameters v Weight π, the K topic weights π can be used as the input of document generation, each in the available vocabulary of document generation The probability distribution of word can calculate the generating probability of document sample according to probability distribution, adjust mind finally by objective function The parameter of parameter and document generation through network makes the generating probability of document sample maximum, and records generating probability maximum When corresponding neural network parameter and document generation parameter.Neural network parameter is for determining neural network model, document Model parameter is generated for determining document generation.
The training process of the application can allow topic model to have the energy for automatically selecting theme number according to the document of input Power, such topic model can more accurately adaptively determine theme number, and the performance of topic model is more preferably.In addition, this Application makes the optimization process of topic model more flexible by introducing neural network.
In order to facilitate technical solution provided by the present application is understood, it is illustrated below by way of specific example.
Assuming that the word for including in a Training document are as follows: { A, A, B, B, B, C, D, E, E }, one in pre-set dictionary Altogether comprising there are six words: A, B, C, D, E, F can be by document tables according to the number that the word in document occurs in dictionary Reach a vector [2,3,1,1,2,0], wherein the numerical value at i-th of position in vector indicates in dictionary at i-th of position The quantity that word occurs in a document.For example, being 2 at first position in the vector, indicate in dictionary at first position Word A occurs twice.
Then vector is input in neural network model, the initial parameter value in neural network model is to be randomly provided , after the neural network model processing by the initial setting up, obtain a series of i.e. a of first intermediate parameters1,...,aK-1, b1,...,bK-1, the second intermediate parameters v will be obtained in these the first intermediate parameters input probability distributed modelsi, by these in second Between parameter viIt is input in folding rod algorithm, obtains the topic weights value π of document samplei, by the topic weights value π of document sampleiIt is defeated Enter to including topic parameter φiDocument generation in, obtain the document current neural network model parameter and Probability distribution under topic parameter.
The generating probability of document is obtained according to the probability distribution, obtains such as formula (8) corresponding objective function and target letter The corresponding optimization aim of number, it is current to judge according to whether the generating probability of document meets the corresponding optimization aim of objective function Whether the parameter and topic parameter of neural network model are accurate.If inaccurate, need to optimize nerve using gradient descent method The parameter and topic parameter of network model.Optimization aim can be frequency of training condition be also possible to objective function value it is no longer bright It is aobvious to become smaller.If meeting optimization aim, deconditioning makes to obtain the parameter of neural network model and the value of topic parameter Neural network model and document generation can be determined with the value of identified parameter.
After obtaining neural network model and document generation, the topic weights of unknown document can be carried out true It is fixed.For example, assuming that the vector of a unknown document is expressed as [1,1,2,3,2,0], which is input to the nerve net after training The first intermediate parameters are obtained in network model, then ginseng among second will be obtained in these the first intermediate parameters input probability distributed models Number, finally by these second intermediate parameters viIt is input in folding rod algorithm, obtains the topic weights value of document sample.
Present invention also provides a kind of determining devices of document subject matter.As shown in figure 5, the device specifically includes: document and Model obtaining unit 501, the first intermediate parameters obtaining unit 502, probability density function obtain unit 503, the second intermediate parameters Obtaining unit 504 and topic weights determination unit 505.
Document and model obtaining unit 501, for obtaining destination document and neural network model, the neural network model For obtaining the first intermediate parameters of preset quantity under the limitation of model parameter;
First intermediate parameters obtaining unit 502 is obtained for the destination document to be input in the neural network model To the first intermediate parameters;
Probability density function obtains unit 503, for first intermediate parameters to be input in probability Distribution Model, obtains To the probability density function of the second intermediate parameters;
Second intermediate parameters obtaining unit 504, for being sampled from the probability density function of second intermediate parameters To the second intermediate parameters of target;
Topic weights determination unit 505 obtains institute for second intermediate parameters of target to be input in folding rod algorithm State the topic weights of destination document.
See Fig. 6, present invention also provides the determining devices of another document subject matter.As shown in fig. 6, the document theme is really Determine device it is shown in Fig. 5 on the basis of, can further include: neural network model training unit 506.
Neural network model training unit 506, for training neural network model;
Wherein, the neural network model training unit 506 is specifically used for:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network is obtained Second intermediate parameters of model;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the master of the document sample is obtained Inscribe weighted value;
The topic weights value of the document sample is input in document generation, is obtained each in the document sample The probability of occurrence of word, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the mould of the neural network model is adjusted The parameter of shape parameter and the document generation, and return to the nerve net that new document sample is input to adjustment model parameter In network model, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network mould of model parameter will be adjusted The neural network model that type is obtained as training.
In one example, the corresponding formula of the objective function are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document Word number in sample;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ (v|w1:N) indicate vector v probability density function;p(W1:N| π, Φ) it indicates to obtain according to document generation and neural network Document sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v | The distance between α), α is the lumped parameter of GEM distribution;
Correspondingly, the goal satisfaction judgment sub-unit is specifically used for: judging the generating probability input of the document sample Whether the value obtained after to the corresponding formula of the objective function meets preset trained termination condition.
In one example, the neural network model training unit is used to adjust the model ginseng of the neural network model The parameter of the several and described document generation, comprising:
Neural network model training unit is specifically used for using reversed gradient propagation algorithm, adjusts the neural network mould The parameter of the model parameter of type and the document generation.
In one example, the topic weights determination unit is used to second intermediate parameters of target being input to folding rod In algorithm, the topic weights of the destination document are obtained, comprising:
Topic weights determination unit is obtained specifically for second intermediate parameters of target are input in folding rod algorithm Each alternative topic weights;And the topic weights of preset condition will be met as the topic weights of the destination document.
See Fig. 7, a hardware structural diagram of equipment is determined for document subject matter provided by the present application.See Fig. 7, this sets Standby may include: processor 701, memory 702 and communication bus 703.
Wherein, processor 701, memory 702 complete mutual communication by communication bus 703.
Processor 701, for executing program, program may include program code, and said program code includes processor Operational order.Wherein, program can be specifically used for:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for obtaining First intermediate parameters of preset quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density letter of the second intermediate parameters is obtained Number;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
Processor 701 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present application Road.
Memory 702, for storing program;Memory 702 may include high speed RAM memory, it is also possible to further include non- Volatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Present invention also provides a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing is stored with a plurality of instruction, the finger It enables and being loaded suitable for processor, to execute the above step related to the determination method of document subject matter.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including above-mentioned element.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (12)

1. a kind of determination method of document subject matter characterized by comprising
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for being preset First intermediate parameters of quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density function of the second intermediate parameters is obtained;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
2. the determination method of document subject matter according to claim 1, which is characterized in that the training of the neural network model Process includes:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network model is obtained The second intermediate parameters;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the theme power of the document sample is obtained Weight values;
The topic weights value of the document sample is input in document generation, each word in the document sample is obtained Probability of occurrence, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the model ginseng of the neural network model is adjusted The parameter of the several and described document generation, and return to the neural network mould that new document sample is input to adjustment model parameter In type, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network model for adjusting model parameter is made The neural network model obtained for training.
3. the determination method of document subject matter according to claim 2, which is characterized in that the corresponding formula of the objective function Are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document sample In word number;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ(v| w1:N) indicate vector v probability density function;p(W1:N| π, Φ) indicate the text obtained according to document generation and neural network Shelves sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v | α) it Between distance, α be GEM distribution lumped parameter;
Correspondingly, whether the generating probability for judging the document sample meets the optimization aim, comprising:
Judge that the generating probability of the document sample is input to whether the value obtained after the corresponding formula of the objective function meets Preset trained termination condition.
4. the determination method of document subject matter according to claim 3, which is characterized in that the adjustment neural network mould The parameter of the model parameter of type and the document generation, comprising:
Using reversed gradient propagation algorithm, the model parameter of the neural network model and the ginseng of the document generation are adjusted Number.
5. the determination method of document subject matter according to claim 1, which is characterized in that it is described will be among the target second Parameter is input in folding rod algorithm, obtains the topic weights of the destination document, comprising:
Second intermediate parameters of target are input in folding rod algorithm, each alternative topic weights are obtained;
The topic weights of preset condition will be met as the topic weights of the destination document.
6. a kind of determining device of document subject matter characterized by comprising
Document and model obtaining unit, for obtaining destination document and neural network model, the neural network model is used for The first intermediate parameters of preset quantity are obtained under the limitation of model parameter;
First intermediate parameters obtaining unit obtains first for the destination document to be input in the neural network model Intermediate parameters;
Probability density function obtains unit, for first intermediate parameters to be input in probability Distribution Model, obtains second The probability density function of intermediate parameters;
Second intermediate parameters obtaining unit, for from the probability density function of second intermediate parameters sampling obtain target the Two intermediate parameters;
Topic weights determination unit obtains the target for second intermediate parameters of target to be input in folding rod algorithm The topic weights of document.
7. the determining device of document subject matter according to claim 6, which is characterized in that further include:
Neural network model training unit, for training neural network model;
Wherein, the neural network model training unit is specifically used for:
Document sample is input in neural network model, the first intermediate parameters of the neural network model are obtained;
First intermediate parameters of the neural network model are input in probability Distribution Model, the neural network model is obtained The second intermediate parameters;
Second intermediate parameters of the neural network model are input in folding rod algorithm, the theme power of the document sample is obtained Weight values;
The topic weights value of the document sample is input in document generation, each word in the document sample is obtained Probability of occurrence, and calculate according to the probability of occurrence generating probability of the document sample;
Judge whether the generating probability of the document sample meets the corresponding optimization aim of objective function;
If the generating probability of the document sample is unsatisfactory for the optimization aim, the model ginseng of the neural network model is adjusted The parameter of the several and described document generation, and return to the neural network mould that new document sample is input to adjustment model parameter In type, it is adjusted the first intermediate parameters of the neural network model of model parameter;
If the generating probability of the document sample meets the optimization aim, the neural network model for adjusting model parameter is made The neural network model obtained for training.
8. the determining device of document subject matter according to claim 7, which is characterized in that the corresponding formula of the objective function Are as follows:
Wherein, L (W1:N| Φ, ψ) indicate objective function;W1:NIndicate the bag of words of document sample, wherein N indicates document sample In word number;φ indicates the parameter in document generation;The parameter of ψ expression neural network;E indicates expectation, qψ(v| w1:N) indicate vector v probability density function;p(W1:N| π, Φ) indicate the text obtained according to document generation and neural network Shelves sample generating probability;KL is relative entropy, and expression is probability density function qψ(v|w1:N) and probability density function p (v | α) it Between distance, α be GEM distribution lumped parameter;
Correspondingly, the goal satisfaction judgment sub-unit is specifically used for: judging that the generating probability of the document sample is input to institute State whether the value obtained after the corresponding formula of objective function meets preset trained termination condition.
9. the determining device of document subject matter according to claim 8, which is characterized in that the neural network model training is single Member is for adjusting the model parameter of the neural network model and the parameter of the document generation, comprising:
Neural network model training unit is specifically used for using reversed gradient propagation algorithm, adjusts the neural network model The parameter of model parameter and the document generation.
10. the determining device of document subject matter according to claim 6, which is characterized in that the topic weights determination unit For second intermediate parameters of target to be input in folding rod algorithm, the topic weights of the destination document are obtained, comprising:
Topic weights determination unit obtains each specifically for second intermediate parameters of target are input in folding rod algorithm Alternative topic weights;And the topic weights of preset condition will be met as the topic weights of the destination document.
11. a kind of document subject matter locking equipment really characterized by comprising memory and processor;The processor passes through fortune Software program, the data of calling storage in the memory, at least execution following steps of row storage in the memory:
Destination document and neural network model are obtained, the neural network model under the limitation of model parameter for being preset First intermediate parameters of quantity;
The destination document is input in the neural network model, the first intermediate parameters are obtained;
First intermediate parameters are input in probability Distribution Model, the probability density function of the second intermediate parameters is obtained;
Sampling obtains the second intermediate parameters of target from the probability density function of second intermediate parameters;
Second intermediate parameters of target are input in folding rod algorithm, the topic weights of the destination document are obtained.
12. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, the determination method such as document subject matter described in any one of claim 1 to 5 is executed.
CN201810350016.3A 2018-04-18 2018-04-18 Document subject matter determines method and relevant device Pending CN110390092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810350016.3A CN110390092A (en) 2018-04-18 2018-04-18 Document subject matter determines method and relevant device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810350016.3A CN110390092A (en) 2018-04-18 2018-04-18 Document subject matter determines method and relevant device

Publications (1)

Publication Number Publication Date
CN110390092A true CN110390092A (en) 2019-10-29

Family

ID=68283322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810350016.3A Pending CN110390092A (en) 2018-04-18 2018-04-18 Document subject matter determines method and relevant device

Country Status (1)

Country Link
CN (1) CN110390092A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105868178A (en) * 2016-03-28 2016-08-17 浙江大学 Multi-document automatic abstract generation method based on phrase subject modeling
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107423439A (en) * 2017-08-04 2017-12-01 逸途(北京)科技有限公司 A kind of Chinese charater problem mapping method based on LDA

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942340A (en) * 2014-05-09 2014-07-23 电子科技大学 Microblog user interest recognizing method based on text mining
CN105868178A (en) * 2016-03-28 2016-08-17 浙江大学 Multi-document automatic abstract generation method based on phrase subject modeling
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107423439A (en) * 2017-08-04 2017-12-01 逸途(北京)科技有限公司 A kind of Chinese charater problem mapping method based on LDA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUEFEI NING等: "A BAYESIAN NONPARAMETRIC TOPIC MODEL WITH VARIATIONAL AUTO-ENCODERS", 《HTTPS://OPENREVIEW.NET/FORUM?ID=SKXQZNGC-》 *

Similar Documents

Publication Publication Date Title
Zhang et al. Identification of core-periphery structure in networks
CN108304679A (en) A kind of adaptive reliability analysis method
CN112529153B (en) BERT model fine tuning method and device based on convolutional neural network
CN108052387B (en) Resource allocation prediction method and system in mobile cloud computing
CN102075352A (en) Method and device for predicting network user behavior
CN104794367B (en) Medical treatment resource scoring based on hidden semantic model is with recommending method
Cai et al. Network linear discriminant analysis
Thiem Mill's Methods, Induction, and Case Sensitivity in Qualitative Comparative Analysis: A Comment on Hug (2013).
CN112163671A (en) New energy scene generation method and system
Chikushi et al. Using spectral entropy and bernoulli map to handle concept drift
Levashenko et al. Fuzzy classifier based on fuzzy decision tree
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
CN110197252A (en) Deep learning based on distance
CN110046344A (en) Add the method and terminal device of separator
Ogden A sequential reduction method for inference in generalized linear mixed models
Liu Time Series Forecasting Based on ARIMA and LSTM
JP6099099B2 (en) Convergence determination apparatus, method, and program
CN110390092A (en) Document subject matter determines method and relevant device
Pan et al. A sequential addressing subsampling method for massive data analysis under memory constraint
Televnoy et al. Neural ODE machine learning method with embedded numerical method
Wang BP network implementation based on computer MATLAB neural network toolbox
Ge et al. Dependency between degree of fit and input noise in fuzzy linear regression using non-symmetric fuzzy triangular coefficients
CN114186646A (en) Block chain abnormal transaction identification method and device, storage medium and electronic equipment
Dodson et al. Iterative approximation of statistical distributions and relation to information geometry
Wang et al. Probability density function estimation based on representative data samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221202

Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518000

Applicant after: Shenzhen Yayue Technology Co.,Ltd.

Address before: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors

Applicant before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.