CN115879515B - Document network theme modeling method, variation neighborhood encoder, terminal and medium - Google Patents
Document network theme modeling method, variation neighborhood encoder, terminal and medium Download PDFInfo
- Publication number
- CN115879515B CN115879515B CN202310135750.9A CN202310135750A CN115879515B CN 115879515 B CN115879515 B CN 115879515B CN 202310135750 A CN202310135750 A CN 202310135750A CN 115879515 B CN115879515 B CN 115879515B
- Authority
- CN
- China
- Prior art keywords
- document
- representation
- neighborhood
- sample
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000009826 distribution Methods 0.000 claims abstract description 101
- 238000004590 computer program Methods 0.000 claims description 20
- 230000007246 mechanism Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 7
- 239000011159 matrix material Substances 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a document network theme modeling method, a variation neighborhood encoder, a terminal and a medium, wherein the method comprises the following steps: acquiring a document network set, and respectively determining document input representations of all documents in the document network set; inputting document input representation of each document into a pre-trained variation neighborhood encoder for encoding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation; a document-topic distribution is determined from the representation of the central document and a topic-word distribution is determined from the document-topic distribution. The invention can effectively determine the representation of the center document based on the hidden layer representation of each document, can effectively determine the document-topic distribution based on the representation of the center document, and can effectively determine the topic-word distribution based on the document-topic distribution so as to achieve the effect of modeling the topics of the document network.
Description
Technical Field
The invention relates to the technical field of topic modeling, in particular to a document network topic modeling method, a variation neighborhood encoder, a terminal and a medium.
Background
A document network is a network composed of documents and their relationships, e.g., a network composed of academic papers that are referenced to each other, a network composed of web page text that are linked to each other, etc. The document network is used as an important component of text data, and people can better understand the content distribution of the document by acquiring the theme of the document in the document network, so that how to effectively model the theme of the document in the document network is a problem to be solved at present.
Disclosure of Invention
The embodiment of the invention aims to provide a document network topic modeling method, a variation neighborhood encoder, a terminal and a medium, which aim to solve the problem of how to effectively model topics of documents in a document network in the prior art.
The embodiment of the invention is realized in such a way that a document network theme modeling method comprises the following steps:
acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
inputting document input representation of each document into a pre-trained variation neighborhood encoder to perform coding processing to obtain hidden layer representation, and determining the representation of a central document according to the hidden layer representation;
a document-topic distribution is determined from the representation of the central document and a topic-word distribution is determined from the document-topic distribution.
Further, the determining the formula adopted by the document input representation of each document in the document network set comprises:
wherein, Vrepresenting a dictionary of words in the document collection,representation document->And document->Length of shortest path between, ∈>For documents->Chinese word->The number of occurrences>Is a text vector, ++>Is a 0-1 neighborhood vector,>is a high order neighborhood vector, ">Representative word->In document->Weights of->Representing a central document.
Further, before the encoding process is performed by the variance neighborhood encoder after the pre-training is input to the document input representation of each document, the method further comprises:
obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for coding processing to obtain sample inferred distribution parameters;
determining sample theme representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample theme representations to obtain a reconstructed document;
determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
and updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the pre-trained variation neighborhood encoder.
Further, the determining the prior loss according to the sample inferred distribution parameters and the prior normal distribution parameters of each sample document, and the formula adopted for determining the reconstruction loss according to each sample document and the reconstruction document comprises:
wherein, for the neighborhood document of each sample document, +.>For neighborhood documents regenerated from hidden topics, < +.>Indicating total loss->Representing reconstruction loss, ++>Representing a priori loss,/->Representing weight parameters->For words in the sample document, +.>To reconstruct words in a sample, KL (-) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters,μandσmean and variance of inferred distribution inferred by inference network in the variance neighborhood encoder, respectively, +.>And->For the mean and variance of the a priori normal distribution parameters,/->Is normally distributed.
Further, the determining a representation of the central document from the hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
and aggregating the neighborhood documents of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document.
Further, the encoding processing is performed on the document input representation of each document by the variance neighborhood encoder after the pre-training, and the obtaining of the formula adopted by the hidden layer representation includes:
wherein, representing an activation function->All represent the training parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder,/->,,And->,Representing the total real space of the system,tfor the number of subjects to be the number,mfor dictionary size, ++>Representing logarithmic variance>Hidden layer representation representing a central document, +.>Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution.
Further, the determining a formula employed by the document-topic distribution from the representation of the central document includes:
wherein, N(d) Representing and centering documentsdThere is a set of neighborhood documents for the path between,representing a center documentdNeighborhood document of->The standard lognormal distribution is represented by a graph,weightrepresenting the extent of influence of the neighborhood document on the center document, < +.>Is a central documentdAnd neighborhood document->The shortest path length between->For the degree of association of the center document with the neighborhood document, < >>Transpose of hidden layer representation for center document, +.>Hidden layer representation for neighborhood document, +.>Attention coefficients for the center document hidden layer representation and the neighborhood document hidden layer representation,θfor document-topic distribution, < >>For an unnormalized central document theme representation, +.>Representing the normalization function.
It is another object of an embodiment of the present invention to provide a variational neighborhood encoder, where the variational neighborhood encoder is applied to any one of the document network topic modeling methods, and the variational neighborhood encoder includes:
an input layer for determining document input representations of respective documents in the document network set, respectively;
the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
an attention layer for gathering a topic representation of the neighborhood document and the center document of each document using dot product attention to obtain a representation of the center document;
and a decoder for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
It is a further object of an embodiment of the present invention to provide a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which processor implements the steps of the method as described above when executing the computer program.
It is a further object of embodiments of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
According to the embodiment of the invention, the document input representation of each document in the document network set is respectively determined, the encoding effect can be effectively achieved on each document based on the document input representation, the hidden layer representation corresponding to each document can be effectively deduced through encoding processing of the variance neighborhood encoder after the document input representation of each document is input in the pre-training, the representation of the central document can be effectively determined based on the hidden layer representation, the document-topic distribution can be effectively determined based on the representation of the central document, and the topic-word distribution can be effectively determined based on the document-topic distribution, so that the topic modeling effect on the document is achieved.
Drawings
FIG. 1 is a flow chart of a document network topic modeling method provided in a first embodiment of the present invention;
FIG. 2 is a flow chart of a document network topic modeling method provided in a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a variation neighborhood encoder according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
Example 1
Referring to fig. 1, a flowchart of a document network topic modeling method according to a first embodiment of the present invention is provided, where the document network topic modeling method can be applied to any terminal device or system, and the document network topic modeling method includes the steps of:
step S10, acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
wherein a given one comprisesnDocument network set of individual documentsBy the following constitutionDDictionary of words in (a) is written asVWhich comprisesmIndividual words, document net setDCan be expressed as a document-word matrix +.>. Wherein (1)>Representative word->In document->For example, TF-IDF).
DThe relationship of the documents in the document is represented by a 0-1 neighborhood matrix, namelyWherein the element->Is 1 +.>,Presence edge, 0 representing absence edge, document networkGDenoted as->WhereinDRepresenting a set of network of documents,Arepresentation ofDA neighborhood matrix of the document in question,Xa document-word matrix is represented and,Vrepresentation ofDA dictionary of words in (a);
in this step, in order to model the high-order graph structure information, the neighborhood matrixANot only the direct connection relation of two documents, namely first order neighborhood information, but also the second order or higher order neighborhood information and a higher order neighborhood matrix are recordedThe definition of the element is shown in the formula (1). Wherein (1)>Representation document->And document->The length of the path between them;
For document-word matrixXThe initialization is performed using a logarithmic regularization scheme as shown in equation (2). Wherein, for documents->Chinese word->Number of occurrences:
Finally, the text vector is represented by formula (3)xNeighborhood vectoraOr higher order neighborhood vectorCombining to obtain a document input representation of each document:
Step S20, inputting document input representation of each document into a pre-trained variation neighborhood encoder to perform coding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation;
optionally, in this step, the determining a representation of the central document according to the hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
gathering the neighborhood document of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document;
wherein the pre-trained variational neighborhood Encoder (Variational Adjacent-Encoder, VADJE) encodes the central document via a fully-concatenated layer, infers hidden layer representations of the documents, and then obtains hidden layer representations of the central document via a re-parameterization and attention mechanism。
In this embodiment, normal distribution is used as prior distribution, the variation neighborhood encoder uses an arctangent function as an activation function of the full-connection layer in the encoding stage, and the training parameters are initialized by using Xavier gloot. The full-connection layer and re-parameterization process of the encoding stage is shown in formula (4):
Wherein, representing an activation function->All represent the mentionedTraining parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder>,,And->,Representing the total real space of the system,tfor the number of subjects to be the number,mfor dictionary size, ++>Representing logarithmic variance>Hidden layer representation representing a central document, +.>Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution. The idea of the re-parameterisation is by distributing +.>And (3) sampling a variable, and carrying out affine transformation on the variable to obtain a needed hidden variable for the problem of back propagation in the VAE-like model.
Step S30, determining a document-topic distribution according to the representation of the central document, and determining a topic-word distribution according to the document-topic distribution;
wherein, after the re-parameterization, the variational neighborhood encoder uses the dot product attention-gathering neighborhood documentTheme representation with center document->Obtaining an unnormalized central document theme representation +.>Subsequently the +.A.A. with the softmax function will be applied>Document-topic distribution for conversion to documentsθThe specific process is shown in the formula (5): />
Wherein, drepresenting a central document that is to be displayed,N(d) Representing and centering documentsdThere is a set of neighborhood documents for the path between,representing a center documentdNeighborhood document of->The standard lognormal distribution is represented by a graph,weightrepresenting the extent of influence of the neighborhood document on the center document, < +.>Is a central documentdAnd neighborhood document->The shortest path length between->For the degree of association of the center document with the neighborhood document, < >>Transpose of hidden layer representation for center document, +.>Hidden layer representation for neighborhood document, +.>Attention coefficients for the center document hidden layer representation and the neighborhood document hidden layer representation,θfor document-topic distribution, < >>For an unnormalized central document theme representation, +.>Representing the normalization function.
In the step, in the decoding stage, the variance neighborhood encoder generates a center document itself and a neighborhood document thereof based on the document relationship existing in the document network, not only by using the hidden subject of the center document, as shown in formula (6):
Wherein, to activate the function +.>And->Trainable parameters for the corresponding full-link layer in the decoder,>for neighborhood documents regenerated from hidden topics, the topic-word distribution can be obtained by softmax variation of weights and bias parameters in the decoderβ。
In this embodiment, the step of generating the document network based on the variance neighborhood encoder includes:
when generating a document, firstly obtaining corresponding distribution parameters through an inference network in a variation neighborhood encoderAnd->,And->Respectively representing a mean value inference network and a standard deviation inference network of the VADJE;
generating topic distribution of documents by re-parameterization. For a given text, each word is generated from the word distribution of the corresponding text, which may be defined by the topic distribution of the document +.>Word distribution with topicsObtained and distributed in a plurality of ways, namely:
wherein, representing a center documentdWord of->Representing a multiple-term distribution, modeling a document connection as a Bernoulli binary variable when it is generated, and calculating the probability of the existence of the connection from the topic distribution of the document, i.eWherein->Representing the fully connected layer of the neural network, +.>Representing the bernoulli distribution.
in this embodiment, by determining the document input representations of the documents in the document network set, the encoding effect can be effectively achieved on the basis of the document input representations, the hidden layer representations corresponding to the documents can be effectively deduced by encoding the document input representations of the documents by the variance neighborhood encoder after the pre-training, the representation to the central document can be effectively determined on the basis of the hidden layer representations, the document-topic distribution can be effectively determined on the basis of the representation of the central document, and the topic-word distribution can be effectively determined on the basis of the document-topic distribution, so as to achieve the topic modeling effect on the documents.
Example two
Referring to fig. 2, a flowchart of a document network topic modeling method according to a second embodiment of the present invention is provided, where the steps before step S20 in the first embodiment are further refined, and the method includes the steps of:
step S40, obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for encoding processing to obtain sample inferred distribution parameters;
based on formulas (1) to (3), respectively obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into a variation neighborhood encoder for encoding processing to obtain sample inferred distribution parameters;
step S50, determining sample topic representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample topic representations to obtain a reconstructed document;
the method comprises the steps of carrying out reparameterization and attention mechanism processing on sample topic representations to obtain sample topic representations, after reparameterization, using a variational neighborhood encoder to gather neighborhood documents of all sample documents and the sample topic representations by dot product attention to obtain sample representations, then converting the sample representations into sample document-topic distributions by a softmax function, determining sample topic-word distributions based on the sample document-topic distributions, and reconstructing all sample documents based on the sample topic-word distributions to obtain reconstructed documents;
step S60, determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
wherein, in the model training stage, for each document, the loss function of the variational neighborhood encoder is divided into two parts, namely reconstruction loss and priori loss: reconstruction loss is the KL divergence between the inferred distribution obtained by the inferred network and the prior normal distribution, as shown in formula (7):
Wherein, neighborhood document for each sample documentdAlso one of its own neighborhood documents)>For neighborhood documents regenerated from hidden topics, KL (·) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters,μandσmeans and variances of inferred distributions inferred for an inference network in the variational neighborhood encoder, the inferred distributions being in the form of normal distributions, parameters includingμAndσ,and->For the mean and variance of the a priori normal parameters, +.>Is normally distributed.
Step S70, updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the pre-trained variation neighborhood encoder;
if the current iteration number of the variation neighborhood encoder is greater than or equal to a number threshold, the variation neighborhood encoder is determined to converge, and the number threshold can be set according to requirements.
In this embodiment, sample input representation of each sample document is input to a variation neighborhood encoder to perform encoding processing, so that sample inference distribution parameters corresponding to each sample document can be effectively obtained, sample topic representations can be effectively determined based on the sample inference distribution parameters, each sample document can be effectively reconstructed based on the sample topic representations to obtain a reconstructed document, priori losses to a variation adjacent encoder can be effectively determined based on the sample inference distribution parameters and priori normal distribution parameters of each sample document, reconstruction losses to the variation neighborhood encoder can be effectively determined based on each sample document and the reconstructed document, parameter updating can be effectively performed to the variation neighborhood encoder based on the priori losses and the reconstruction losses, accuracy of parameters in the variation neighborhood encoder is improved, and accuracy of modeling of a document network topic is improved.
Example III
Referring to fig. 3, a schematic structure diagram of a variable neighborhood encoder according to a third embodiment of the present invention includes:
an input layer for determining document input representations of respective documents in the document network set, respectively; wherein, for each document in the document network, the input layer aims to obtain the corresponding input representation of each document.
And the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document.
The coding layer comprises an encoder 10 and a re-parameterization layer 11, the encoder is used for encoding the central document through a full connection layer to infer a hidden layer representation, and the re-parameterization layer 11 is used for obtaining a theme representation of the central document through a re-parameterization and attention mechanism. A normal distribution is used as a priori distribution in the encoder 10.
An attention layer 12 for gathering a topic representation of the center document and a neighborhood document of each document using dot product attention, resulting in a representation of the center document; wherein, after the reparameterization, the variation neighborhood encoder uses the dot product attention to gather the topic representations of the neighborhood document and the center document to obtain a representation of the center document.
A decoder 13 for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
According to the embodiment, the document input representations of the documents in the document network set are respectively determined, the encoding effect can be effectively achieved on the documents based on the document input representations, the hidden layer representations corresponding to the documents can be effectively deduced through encoding processing of the variance neighborhood encoder after the document input representations of the documents are input and pre-trained, the representation of the central document can be effectively determined based on the hidden layer representations, the document-topic distribution can be effectively determined based on the representation of the central document, the topic-word distribution can be effectively determined based on the document-topic distribution, and the topic modeling effect on the documents can be achieved.
Example IV
Fig. 4 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22 stored in said memory 21 and executable on said processor 20, for example a program of a document network topic modeling method. The steps of the various embodiments of the document network topic modeling method described above are implemented by the processor 20 when executing the computer program 22.
Illustratively, the computer program 22 may be partitioned into one or more modules that are stored in the memory 21 and executed by the processor 20 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.
The processor 20 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 21 may also be used for temporarily storing data that has been output or is to be output.
In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Wherein the computer readable storage medium may be nonvolatile or volatile. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.
Claims (7)
1. A document network topic modeling method, the method comprising the steps of:
acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
inputting document input representation of each document into a pre-trained variation neighborhood encoder for encoding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation;
determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution;
the determining the formula adopted by the document input representation of each document in the document network set comprises the following steps:
wherein, Vrepresenting a dictionary of words in the document collection,representation document->And document->The length of the shortest path between them,for documents->Chinese word->The number of occurrences>Is a text vector, ++>Is a 0-1 neighborhood vector,>is a high-order neighborhood vector that,representative word->In document->Weights of->Representing a central document;
before the document input representation of each document is input to the pre-trained variance neighborhood encoder for encoding, the method further comprises the following steps:
obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for coding processing to obtain sample inferred distribution parameters;
determining sample theme representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample theme representations to obtain a reconstructed document;
determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the variation neighborhood encoder after pre-training;
the prior loss is determined according to the sample inferred distribution parameters and the prior normal distribution parameters of each sample document, and the formula adopted for determining the reconstruction loss according to each sample document and the reconstruction document comprises the following steps:
wherein, for the neighborhood document of each sample document, +.>For neighborhood documents regenerated from hidden topics, < +.>Indicating total loss->Representing reconstruction loss, ++>Representing a priori loss,/->Representing weight parameters->For words in the sample document, +.>For reconstructing words in a sample, KL (·) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters, +.>And->Mean and variance of inferred distribution inferred by inference network in the variance neighborhood encoder, respectively, +.>And->For the mean and variance of the a priori normal distribution parameters,/->Is normally distributed.
2. The document network topic modeling method of claim 1, wherein said determining a representation of a center document from said hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
and aggregating the neighborhood documents of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document.
3. The modeling method of document network topics according to claim 2, wherein the encoding the document input representation of each document by the variance neighborhood encoder after the pre-training to obtain the formula adopted by the hidden layer representation of each document comprises:
wherein, representing an activation function->All represent the training parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder,/->,,And->,Representing the total real space, +.>For the number of subjects->For dictionary size, ++>Representing logarithmic variance>Hidden layer representation representing a central document, +.>Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution.
4. The document network topic modeling method of claim 3, wherein said determining a formula for document-topic distribution from the representation of the center document includes:
wherein, representation +.>Adjacent with paths betweenDomain document set,/->Representing a center documentNeighborhood document of->Representing a standard lognormal distribution,/->Representing the extent of influence of the neighborhood document on the center document, < +.>Is a center document->And neighborhood document->The shortest path length between->For the degree of association of the center document with the neighborhood document, < >>Transpose of hidden layer representation for center document, +.>Hidden layer representation for neighborhood document, +.>Attention coefficients for the central document hidden layer representation and the neighborhood document hidden layer representation, +.>For document-topic distribution, < >>For an unnormalized central document theme representation, +.>Representing the normalization function.
5. A variational neighborhood encoder, applied to the document network topic modeling method of any of claims 1-4, said variational neighborhood encoder comprising:
an input layer for determining document input representations of respective documents in the document network set, respectively;
the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
an attention layer for gathering a topic representation of the neighborhood document and the center document of each document using dot product attention to obtain a representation of the center document;
and a decoder for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310135750.9A CN115879515B (en) | 2023-02-20 | 2023-02-20 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310135750.9A CN115879515B (en) | 2023-02-20 | 2023-02-20 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115879515A CN115879515A (en) | 2023-03-31 |
CN115879515B true CN115879515B (en) | 2023-05-12 |
Family
ID=85761364
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310135750.9A Active CN115879515B (en) | 2023-02-20 | 2023-02-20 | Document network theme modeling method, variation neighborhood encoder, terminal and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115879515B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
CN110866958A (en) * | 2019-10-28 | 2020-03-06 | 清华大学深圳国际研究生院 | Method for text to image |
CN112836017A (en) * | 2021-02-09 | 2021-05-25 | 天津大学 | Event detection method based on hierarchical theme-driven self-attention mechanism |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2386039B (en) * | 2002-03-01 | 2005-07-06 | Fujitsu Ltd | Data encoding and decoding apparatus and a data encoding and decoding method |
US10346524B1 (en) * | 2018-03-29 | 2019-07-09 | Sap Se | Position-dependent word salience estimation |
CN110457708B (en) * | 2019-08-16 | 2023-05-16 | 腾讯科技(深圳)有限公司 | Vocabulary mining method and device based on artificial intelligence, server and storage medium |
CN111949790A (en) * | 2020-07-20 | 2020-11-17 | 重庆邮电大学 | Emotion classification method based on LDA topic model and hierarchical neural network |
CN112199607A (en) * | 2020-10-30 | 2021-01-08 | 天津大学 | Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood |
CN113434664B (en) * | 2021-06-30 | 2024-07-16 | 平安科技(深圳)有限公司 | Text abstract generation method, device, medium and electronic equipment |
CN114116974A (en) * | 2021-11-19 | 2022-03-01 | 深圳市东汇精密机电有限公司 | Emotional cause extraction method based on attention mechanism |
CN114281990A (en) * | 2021-12-17 | 2022-04-05 | 北京百度网讯科技有限公司 | Document classification method and device, electronic equipment and medium |
-
2023
- 2023-02-20 CN CN202310135750.9A patent/CN115879515B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970733A (en) * | 2014-04-10 | 2014-08-06 | 北京大学 | New Chinese word recognition method based on graph structure |
CN110866958A (en) * | 2019-10-28 | 2020-03-06 | 清华大学深圳国际研究生院 | Method for text to image |
CN112836017A (en) * | 2021-02-09 | 2021-05-25 | 天津大学 | Event detection method based on hierarchical theme-driven self-attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN115879515A (en) | 2023-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dhingra et al. | Embedding text in hyperbolic spaces | |
US11886955B2 (en) | Self-supervised data obfuscation in foundation models | |
CN111930895B (en) | MRC-based document data retrieval method, device, equipment and storage medium | |
Xu et al. | Microblog dimensionality reduction—a deep learning approach | |
CN113408706B (en) | Method and device for training user interest mining model and user interest mining | |
CN113590761A (en) | Training method of text processing model, text processing method and related equipment | |
Du et al. | Matrix factorization techniques in machine learning, signal processing, and statistics | |
CN110705279A (en) | Vocabulary selection method and device and computer readable storage medium | |
CN117312777A (en) | Industrial equipment time sequence generation method and device based on diffusion model | |
CN115169342A (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
CN116127925B (en) | Text data enhancement method and device based on destruction processing of text | |
CN112307738B (en) | Method and device for processing text | |
Sheng et al. | LA-ESN: a novel method for time series classification | |
CN115879515B (en) | Document network theme modeling method, variation neighborhood encoder, terminal and medium | |
JP2019021218A (en) | Learning device, program parameter, learning method and model | |
CN116561298A (en) | Title generation method, device, equipment and storage medium based on artificial intelligence | |
Xu et al. | Treelstm with tag-aware hypernetwork for sentence representation | |
WO2024091291A1 (en) | Self-supervised data obfuscation in foundation models | |
CN110889293A (en) | Method, device, equipment and storage medium for constructing multi-level theme vector space | |
CN111723186A (en) | Knowledge graph generation method based on artificial intelligence for dialog system and electronic equipment | |
Zhang | Clustering high-dimensional time series based on parallelism | |
CN114491076B (en) | Data enhancement method, device, equipment and medium based on domain knowledge graph | |
Liu et al. | Machine learning and data analysis for word segmentation of classical Chinese poems: illustrations with Tang and Song examples | |
Ye et al. | Data Preparation and Engineering | |
Chao et al. | Deep cross-dimensional attention hashing for image retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |