CN115879515B - Document network theme modeling method, variation neighborhood encoder, terminal and medium - Google Patents

Document network theme modeling method, variation neighborhood encoder, terminal and medium Download PDF

Info

Publication number
CN115879515B
CN115879515B CN202310135750.9A CN202310135750A CN115879515B CN 115879515 B CN115879515 B CN 115879515B CN 202310135750 A CN202310135750 A CN 202310135750A CN 115879515 B CN115879515 B CN 115879515B
Authority
CN
China
Prior art keywords
document
representation
neighborhood
sample
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310135750.9A
Other languages
Chinese (zh)
Other versions
CN115879515A (en
Inventor
刘德喜
张子靖
刘嘉鸣
万齐智
邓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi University of Finance and Economics
Original Assignee
Jiangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi University of Finance and Economics filed Critical Jiangxi University of Finance and Economics
Priority to CN202310135750.9A priority Critical patent/CN115879515B/en
Publication of CN115879515A publication Critical patent/CN115879515A/en
Application granted granted Critical
Publication of CN115879515B publication Critical patent/CN115879515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a document network theme modeling method, a variation neighborhood encoder, a terminal and a medium, wherein the method comprises the following steps: acquiring a document network set, and respectively determining document input representations of all documents in the document network set; inputting document input representation of each document into a pre-trained variation neighborhood encoder for encoding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation; a document-topic distribution is determined from the representation of the central document and a topic-word distribution is determined from the document-topic distribution. The invention can effectively determine the representation of the center document based on the hidden layer representation of each document, can effectively determine the document-topic distribution based on the representation of the center document, and can effectively determine the topic-word distribution based on the document-topic distribution so as to achieve the effect of modeling the topics of the document network.

Description

Document network theme modeling method, variation neighborhood encoder, terminal and medium
Technical Field
The invention relates to the technical field of topic modeling, in particular to a document network topic modeling method, a variation neighborhood encoder, a terminal and a medium.
Background
A document network is a network composed of documents and their relationships, e.g., a network composed of academic papers that are referenced to each other, a network composed of web page text that are linked to each other, etc. The document network is used as an important component of text data, and people can better understand the content distribution of the document by acquiring the theme of the document in the document network, so that how to effectively model the theme of the document in the document network is a problem to be solved at present.
Disclosure of Invention
The embodiment of the invention aims to provide a document network topic modeling method, a variation neighborhood encoder, a terminal and a medium, which aim to solve the problem of how to effectively model topics of documents in a document network in the prior art.
The embodiment of the invention is realized in such a way that a document network theme modeling method comprises the following steps:
acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
inputting document input representation of each document into a pre-trained variation neighborhood encoder to perform coding processing to obtain hidden layer representation, and determining the representation of a central document according to the hidden layer representation;
a document-topic distribution is determined from the representation of the central document and a topic-word distribution is determined from the document-topic distribution.
Further, the determining the formula adopted by the document input representation of each document in the document network set comprises:
Figure SMS_1
Figure SMS_2
Figure SMS_3
wherein, Vrepresenting a dictionary of words in the document collection,
Figure SMS_6
representation document->
Figure SMS_11
And document->
Figure SMS_14
Length of shortest path between, ∈>
Figure SMS_5
For documents->
Figure SMS_9
Chinese word->
Figure SMS_13
The number of occurrences>
Figure SMS_16
Is a text vector, ++>
Figure SMS_4
Is a 0-1 neighborhood vector,>
Figure SMS_8
is a high order neighborhood vector, ">
Figure SMS_12
Representative word->
Figure SMS_15
In document->
Figure SMS_7
Weights of->
Figure SMS_10
Representing a central document.
Further, before the encoding process is performed by the variance neighborhood encoder after the pre-training is input to the document input representation of each document, the method further comprises:
obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for coding processing to obtain sample inferred distribution parameters;
determining sample theme representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample theme representations to obtain a reconstructed document;
determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
and updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the pre-trained variation neighborhood encoder.
Further, the determining the prior loss according to the sample inferred distribution parameters and the prior normal distribution parameters of each sample document, and the formula adopted for determining the reconstruction loss according to each sample document and the reconstruction document comprises:
Figure SMS_17
wherein,
Figure SMS_19
for the neighborhood document of each sample document, +.>
Figure SMS_23
For neighborhood documents regenerated from hidden topics, < +.>
Figure SMS_26
Indicating total loss->
Figure SMS_20
Representing reconstruction loss, ++>
Figure SMS_22
Representing a priori loss,/->
Figure SMS_25
Representing weight parameters->
Figure SMS_28
For words in the sample document, +.>
Figure SMS_18
To reconstruct words in a sample, KL (-) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters,μandσmean and variance of inferred distribution inferred by inference network in the variance neighborhood encoder, respectively, +.>
Figure SMS_21
And->
Figure SMS_24
For the mean and variance of the a priori normal distribution parameters,/->
Figure SMS_27
Is normally distributed.
Further, the determining a representation of the central document from the hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
and aggregating the neighborhood documents of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document.
Further, the encoding processing is performed on the document input representation of each document by the variance neighborhood encoder after the pre-training, and the obtaining of the formula adopted by the hidden layer representation includes:
Figure SMS_29
wherein,
Figure SMS_30
representing an activation function->
Figure SMS_33
All represent the training parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder,/->
Figure SMS_36
Figure SMS_32
Figure SMS_35
And->
Figure SMS_38
Figure SMS_39
Representing the total real space of the system,tfor the number of subjects to be the number,mfor dictionary size, ++>
Figure SMS_31
Representing logarithmic variance>
Figure SMS_34
Hidden layer representation representing a central document, +.>
Figure SMS_37
Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution.
Further, the determining a formula employed by the document-topic distribution from the representation of the central document includes:
Figure SMS_40
wherein, N(d) Representing and centering documentsdThere is a set of neighborhood documents for the path between,
Figure SMS_43
representing a center documentdNeighborhood document of->
Figure SMS_46
The standard lognormal distribution is represented by a graph,weightrepresenting the extent of influence of the neighborhood document on the center document, < +.>
Figure SMS_49
Is a central documentdAnd neighborhood document->
Figure SMS_42
The shortest path length between->
Figure SMS_45
For the degree of association of the center document with the neighborhood document, < >>
Figure SMS_48
Transpose of hidden layer representation for center document, +.>
Figure SMS_50
Hidden layer representation for neighborhood document, +.>
Figure SMS_41
Attention coefficients for the center document hidden layer representation and the neighborhood document hidden layer representation,θfor document-topic distribution, < >>
Figure SMS_44
For an unnormalized central document theme representation, +.>
Figure SMS_47
Representing the normalization function.
It is another object of an embodiment of the present invention to provide a variational neighborhood encoder, where the variational neighborhood encoder is applied to any one of the document network topic modeling methods, and the variational neighborhood encoder includes:
an input layer for determining document input representations of respective documents in the document network set, respectively;
the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
an attention layer for gathering a topic representation of the neighborhood document and the center document of each document using dot product attention to obtain a representation of the center document;
and a decoder for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
It is a further object of an embodiment of the present invention to provide a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which processor implements the steps of the method as described above when executing the computer program.
It is a further object of embodiments of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
According to the embodiment of the invention, the document input representation of each document in the document network set is respectively determined, the encoding effect can be effectively achieved on each document based on the document input representation, the hidden layer representation corresponding to each document can be effectively deduced through encoding processing of the variance neighborhood encoder after the document input representation of each document is input in the pre-training, the representation of the central document can be effectively determined based on the hidden layer representation, the document-topic distribution can be effectively determined based on the representation of the central document, and the topic-word distribution can be effectively determined based on the document-topic distribution, so that the topic modeling effect on the document is achieved.
Drawings
FIG. 1 is a flow chart of a document network topic modeling method provided in a first embodiment of the present invention;
FIG. 2 is a flow chart of a document network topic modeling method provided in a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a variation neighborhood encoder according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
Example 1
Referring to fig. 1, a flowchart of a document network topic modeling method according to a first embodiment of the present invention is provided, where the document network topic modeling method can be applied to any terminal device or system, and the document network topic modeling method includes the steps of:
step S10, acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
wherein a given one comprisesnDocument network set of individual documents
Figure SMS_51
By the following constitutionDDictionary of words in (a) is written asVWhich comprisesmIndividual words, document net setDCan be expressed as a document-word matrix +.>
Figure SMS_52
. Wherein (1)>
Figure SMS_53
Representative word->
Figure SMS_54
In document->
Figure SMS_55
For example, TF-IDF).
DThe relationship of the documents in the document is represented by a 0-1 neighborhood matrix, namely
Figure SMS_56
Wherein the element->
Figure SMS_57
Is 1 +.>
Figure SMS_58
Figure SMS_59
Presence edge, 0 representing absence edge, document networkGDenoted as->
Figure SMS_60
WhereinDRepresenting a set of network of documents,Arepresentation ofDA neighborhood matrix of the document in question,Xa document-word matrix is represented and,Vrepresentation ofDA dictionary of words in (a);
in this step, in order to model the high-order graph structure information, the neighborhood matrixANot only the direct connection relation of two documents, namely first order neighborhood information, but also the second order or higher order neighborhood information and a higher order neighborhood matrix are recorded
Figure SMS_61
The definition of the element is shown in the formula (1). Wherein (1)>
Figure SMS_62
Representation document->
Figure SMS_63
And document->
Figure SMS_64
The length of the path between them;
Figure SMS_65
the method comprises the steps of carrying out a first treatment on the surface of the Formula (1)
For document-word matrixXThe initialization is performed using a logarithmic regularization scheme as shown in equation (2). Wherein,
Figure SMS_66
for documents->
Figure SMS_67
Chinese word->
Figure SMS_68
Number of occurrences:
Figure SMS_69
the method comprises the steps of carrying out a first treatment on the surface of the Formula (2)>
Finally, the text vector is represented by formula (3)xNeighborhood vectoraOr higher order neighborhood vector
Figure SMS_70
Combining to obtain a document input representation of each document:
Figure SMS_71
the method comprises the steps of carrying out a first treatment on the surface of the Formula (3)
Step S20, inputting document input representation of each document into a pre-trained variation neighborhood encoder to perform coding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation;
optionally, in this step, the determining a representation of the central document according to the hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
gathering the neighborhood document of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document;
wherein the pre-trained variational neighborhood Encoder (Variational Adjacent-Encoder, VADJE) encodes the central document via a fully-concatenated layer, infers hidden layer representations of the documents, and then obtains hidden layer representations of the central document via a re-parameterization and attention mechanism
Figure SMS_72
In this embodiment, normal distribution is used as prior distribution, the variation neighborhood encoder uses an arctangent function as an activation function of the full-connection layer in the encoding stage, and the training parameters are initialized by using Xavier gloot. The full-connection layer and re-parameterization process of the encoding stage is shown in formula (4):
Figure SMS_73
the method comprises the steps of carrying out a first treatment on the surface of the Formula (4)
Wherein,
Figure SMS_76
representing an activation function->
Figure SMS_79
All represent the mentionedTraining parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder>
Figure SMS_83
Figure SMS_75
Figure SMS_77
And->
Figure SMS_80
Figure SMS_82
Representing the total real space of the system,tfor the number of subjects to be the number,mfor dictionary size, ++>
Figure SMS_74
Representing logarithmic variance>
Figure SMS_78
Hidden layer representation representing a central document, +.>
Figure SMS_81
Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution. The idea of the re-parameterisation is by distributing +.>
Figure SMS_84
And (3) sampling a variable, and carrying out affine transformation on the variable to obtain a needed hidden variable for the problem of back propagation in the VAE-like model.
Step S30, determining a document-topic distribution according to the representation of the central document, and determining a topic-word distribution according to the document-topic distribution;
wherein, after the re-parameterization, the variational neighborhood encoder uses the dot product attention-gathering neighborhood document
Figure SMS_85
Theme representation with center document->
Figure SMS_86
Obtaining an unnormalized central document theme representation +.>
Figure SMS_87
Subsequently the +.A.A. with the softmax function will be applied>
Figure SMS_88
Document-topic distribution for conversion to documentsθThe specific process is shown in the formula (5): />
Figure SMS_89
The method comprises the steps of carrying out a first treatment on the surface of the Formula (5)
Wherein, drepresenting a central document that is to be displayed,N(d) Representing and centering documentsdThere is a set of neighborhood documents for the path between,
Figure SMS_91
representing a center documentdNeighborhood document of->
Figure SMS_95
The standard lognormal distribution is represented by a graph,weightrepresenting the extent of influence of the neighborhood document on the center document, < +.>
Figure SMS_98
Is a central documentdAnd neighborhood document->
Figure SMS_92
The shortest path length between->
Figure SMS_94
For the degree of association of the center document with the neighborhood document, < >>
Figure SMS_97
Transpose of hidden layer representation for center document, +.>
Figure SMS_99
Hidden layer representation for neighborhood document, +.>
Figure SMS_90
Attention coefficients for the center document hidden layer representation and the neighborhood document hidden layer representation,θfor document-topic distribution, < >>
Figure SMS_93
For an unnormalized central document theme representation, +.>
Figure SMS_96
Representing the normalization function.
In the step, in the decoding stage, the variance neighborhood encoder generates a center document itself and a neighborhood document thereof based on the document relationship existing in the document network, not only by using the hidden subject of the center document, as shown in formula (6):
Figure SMS_100
the method comprises the steps of carrying out a first treatment on the surface of the Formula (6)
Wherein,
Figure SMS_101
to activate the function +.>
Figure SMS_102
And->
Figure SMS_103
Trainable parameters for the corresponding full-link layer in the decoder,>
Figure SMS_104
for neighborhood documents regenerated from hidden topics, the topic-word distribution can be obtained by softmax variation of weights and bias parameters in the decoderβ
In this embodiment, the step of generating the document network based on the variance neighborhood encoder includes:
when generating a document, firstly obtaining corresponding distribution parameters through an inference network in a variation neighborhood encoder
Figure SMS_105
And->
Figure SMS_106
Figure SMS_107
And->
Figure SMS_108
Respectively representing a mean value inference network and a standard deviation inference network of the VADJE;
generating topic distribution of documents by re-parameterization
Figure SMS_109
. For a given text, each word is generated from the word distribution of the corresponding text, which may be defined by the topic distribution of the document +.>
Figure SMS_110
Word distribution with topics
Figure SMS_111
Obtained and distributed in a plurality of ways, namely:
Figure SMS_112
wherein,
Figure SMS_113
representing a center documentdWord of->
Figure SMS_114
Representing a multiple-term distribution, modeling a document connection as a Bernoulli binary variable when it is generated, and calculating the probability of the existence of the connection from the topic distribution of the document, i.e
Figure SMS_115
Wherein->
Figure SMS_116
Representing the fully connected layer of the neural network, +.>
Figure SMS_117
Representing the bernoulli distribution.
Specifically, for each document
Figure SMS_118
;/>
Generating a mean vector
Figure SMS_119
Figure SMS_120
Generating logarithmic covariance
Figure SMS_121
Figure SMS_122
Generating samples of a normal distribution of multiple criteria
Figure SMS_123
Figure SMS_124
Generating text topic distributions
Figure SMS_125
Figure SMS_126
For each word
Figure SMS_127
Generating words
Figure SMS_128
For each pair of documentsdAnd
Figure SMS_129
generating a connection:
Figure SMS_130
in this embodiment, by determining the document input representations of the documents in the document network set, the encoding effect can be effectively achieved on the basis of the document input representations, the hidden layer representations corresponding to the documents can be effectively deduced by encoding the document input representations of the documents by the variance neighborhood encoder after the pre-training, the representation to the central document can be effectively determined on the basis of the hidden layer representations, the document-topic distribution can be effectively determined on the basis of the representation of the central document, and the topic-word distribution can be effectively determined on the basis of the document-topic distribution, so as to achieve the topic modeling effect on the documents.
Example two
Referring to fig. 2, a flowchart of a document network topic modeling method according to a second embodiment of the present invention is provided, where the steps before step S20 in the first embodiment are further refined, and the method includes the steps of:
step S40, obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for encoding processing to obtain sample inferred distribution parameters;
based on formulas (1) to (3), respectively obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into a variation neighborhood encoder for encoding processing to obtain sample inferred distribution parameters;
step S50, determining sample topic representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample topic representations to obtain a reconstructed document;
the method comprises the steps of carrying out reparameterization and attention mechanism processing on sample topic representations to obtain sample topic representations, after reparameterization, using a variational neighborhood encoder to gather neighborhood documents of all sample documents and the sample topic representations by dot product attention to obtain sample representations, then converting the sample representations into sample document-topic distributions by a softmax function, determining sample topic-word distributions based on the sample document-topic distributions, and reconstructing all sample documents based on the sample topic-word distributions to obtain reconstructed documents;
step S60, determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
wherein, in the model training stage, for each document, the loss function of the variational neighborhood encoder is divided into two parts, namely reconstruction loss and priori loss: reconstruction loss is the KL divergence between the inferred distribution obtained by the inferred network and the prior normal distribution, as shown in formula (7):
Figure SMS_131
the method comprises the steps of carrying out a first treatment on the surface of the Formula (7)
Wherein,
Figure SMS_132
neighborhood document for each sample documentdAlso one of its own neighborhood documents)>
Figure SMS_133
For neighborhood documents regenerated from hidden topics, KL (·) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters,μandσmeans and variances of inferred distributions inferred for an inference network in the variational neighborhood encoder, the inferred distributions being in the form of normal distributions, parameters includingμAndσ
Figure SMS_134
and->
Figure SMS_135
For the mean and variance of the a priori normal parameters, +.>
Figure SMS_136
Is normally distributed.
Step S70, updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the pre-trained variation neighborhood encoder;
if the current iteration number of the variation neighborhood encoder is greater than or equal to a number threshold, the variation neighborhood encoder is determined to converge, and the number threshold can be set according to requirements.
In this embodiment, sample input representation of each sample document is input to a variation neighborhood encoder to perform encoding processing, so that sample inference distribution parameters corresponding to each sample document can be effectively obtained, sample topic representations can be effectively determined based on the sample inference distribution parameters, each sample document can be effectively reconstructed based on the sample topic representations to obtain a reconstructed document, priori losses to a variation adjacent encoder can be effectively determined based on the sample inference distribution parameters and priori normal distribution parameters of each sample document, reconstruction losses to the variation neighborhood encoder can be effectively determined based on each sample document and the reconstructed document, parameter updating can be effectively performed to the variation neighborhood encoder based on the priori losses and the reconstruction losses, accuracy of parameters in the variation neighborhood encoder is improved, and accuracy of modeling of a document network topic is improved.
Example III
Referring to fig. 3, a schematic structure diagram of a variable neighborhood encoder according to a third embodiment of the present invention includes:
an input layer for determining document input representations of respective documents in the document network set, respectively; wherein, for each document in the document network, the input layer aims to obtain the corresponding input representation of each document.
And the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document.
The coding layer comprises an encoder 10 and a re-parameterization layer 11, the encoder is used for encoding the central document through a full connection layer to infer a hidden layer representation, and the re-parameterization layer 11 is used for obtaining a theme representation of the central document through a re-parameterization and attention mechanism. A normal distribution is used as a priori distribution in the encoder 10.
An attention layer 12 for gathering a topic representation of the center document and a neighborhood document of each document using dot product attention, resulting in a representation of the center document; wherein, after the reparameterization, the variation neighborhood encoder uses the dot product attention to gather the topic representations of the neighborhood document and the center document to obtain a representation of the center document.
A decoder 13 for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
According to the embodiment, the document input representations of the documents in the document network set are respectively determined, the encoding effect can be effectively achieved on the documents based on the document input representations, the hidden layer representations corresponding to the documents can be effectively deduced through encoding processing of the variance neighborhood encoder after the document input representations of the documents are input and pre-trained, the representation of the central document can be effectively determined based on the hidden layer representations, the document-topic distribution can be effectively determined based on the representation of the central document, the topic-word distribution can be effectively determined based on the document-topic distribution, and the topic modeling effect on the documents can be achieved.
Example IV
Fig. 4 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22 stored in said memory 21 and executable on said processor 20, for example a program of a document network topic modeling method. The steps of the various embodiments of the document network topic modeling method described above are implemented by the processor 20 when executing the computer program 22.
Illustratively, the computer program 22 may be partitioned into one or more modules that are stored in the memory 21 and executed by the processor 20 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.
The processor 20 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 21 may also be used for temporarily storing data that has been output or is to be output.
In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Wherein the computer readable storage medium may be nonvolatile or volatile. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (7)

1. A document network topic modeling method, the method comprising the steps of:
acquiring a document network set, and respectively determining document input representations of all documents in the document network set;
inputting document input representation of each document into a pre-trained variation neighborhood encoder for encoding processing to obtain hidden layer representation of each document, and determining the representation of a central document according to the hidden layer representation;
determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution;
the determining the formula adopted by the document input representation of each document in the document network set comprises the following steps:
Figure QLYQS_1
Figure QLYQS_2
Figure QLYQS_3
wherein, Vrepresenting a dictionary of words in the document collection,
Figure QLYQS_6
representation document->
Figure QLYQS_9
And document->
Figure QLYQS_13
The length of the shortest path between them,
Figure QLYQS_5
for documents->
Figure QLYQS_11
Chinese word->
Figure QLYQS_14
The number of occurrences>
Figure QLYQS_16
Is a text vector, ++>
Figure QLYQS_4
Is a 0-1 neighborhood vector,>
Figure QLYQS_8
is a high-order neighborhood vector that,
Figure QLYQS_12
representative word->
Figure QLYQS_15
In document->
Figure QLYQS_7
Weights of->
Figure QLYQS_10
Representing a central document;
before the document input representation of each document is input to the pre-trained variance neighborhood encoder for encoding, the method further comprises the following steps:
obtaining sample input representations of all sample documents, and inputting the sample input representations of all sample documents into the variation neighborhood encoder for coding processing to obtain sample inferred distribution parameters;
determining sample theme representations according to the sample inferred distribution parameters, and reconstructing each sample document according to the sample theme representations to obtain a reconstructed document;
determining prior loss according to sample inferred distribution parameters and prior normal distribution parameters of each sample document, and determining reconstruction loss according to each sample document and the reconstruction document;
updating parameters of the variation neighborhood encoder according to the prior loss and the reconstruction loss until the variation neighborhood encoder converges, so as to obtain the variation neighborhood encoder after pre-training;
the prior loss is determined according to the sample inferred distribution parameters and the prior normal distribution parameters of each sample document, and the formula adopted for determining the reconstruction loss according to each sample document and the reconstruction document comprises the following steps:
Figure QLYQS_17
wherein,
Figure QLYQS_20
for the neighborhood document of each sample document, +.>
Figure QLYQS_24
For neighborhood documents regenerated from hidden topics, < +.>
Figure QLYQS_28
Indicating total loss->
Figure QLYQS_19
Representing reconstruction loss, ++>
Figure QLYQS_23
Representing a priori loss,/->
Figure QLYQS_27
Representing weight parameters->
Figure QLYQS_30
For words in the sample document, +.>
Figure QLYQS_18
For reconstructing words in a sample, KL (·) represents the KL divergence of the sample inferred distribution parameters and a priori normal distribution parameters, +.>
Figure QLYQS_22
And->
Figure QLYQS_26
Mean and variance of inferred distribution inferred by inference network in the variance neighborhood encoder, respectively, +.>
Figure QLYQS_29
And->
Figure QLYQS_21
For the mean and variance of the a priori normal distribution parameters,/->
Figure QLYQS_25
Is normally distributed.
2. The document network topic modeling method of claim 1, wherein said determining a representation of a center document from said hidden layer representation includes:
performing re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
and aggregating the neighborhood documents of each document and the topic representation of the center document by using a dot product attention mechanism to obtain the representation of the center document.
3. The modeling method of document network topics according to claim 2, wherein the encoding the document input representation of each document by the variance neighborhood encoder after the pre-training to obtain the formula adopted by the hidden layer representation of each document comprises:
Figure QLYQS_31
wherein,
Figure QLYQS_34
representing an activation function->
Figure QLYQS_38
All represent the training parameters of the encoder corresponding to the full connection layer in the variation neighborhood encoder,/->
Figure QLYQS_41
Figure QLYQS_35
Figure QLYQS_37
And->
Figure QLYQS_40
Figure QLYQS_43
Representing the total real space, +.>
Figure QLYQS_32
For the number of subjects->
Figure QLYQS_36
For dictionary size, ++>
Figure QLYQS_39
Representing logarithmic variance>
Figure QLYQS_42
Hidden layer representation representing a central document, +.>
Figure QLYQS_33
Representing slave scale andμandσthe resulting sample representation is randomly generated in the same multivariate normal distribution.
4. The document network topic modeling method of claim 3, wherein said determining a formula for document-topic distribution from the representation of the center document includes:
Figure QLYQS_44
wherein,
Figure QLYQS_45
representation +.>
Figure QLYQS_53
Adjacent with paths betweenDomain document set,/->
Figure QLYQS_58
Representing a center document
Figure QLYQS_49
Neighborhood document of->
Figure QLYQS_57
Representing a standard lognormal distribution,/->
Figure QLYQS_52
Representing the extent of influence of the neighborhood document on the center document, < +.>
Figure QLYQS_55
Is a center document->
Figure QLYQS_48
And neighborhood document->
Figure QLYQS_59
The shortest path length between->
Figure QLYQS_46
For the degree of association of the center document with the neighborhood document, < >>
Figure QLYQS_54
Transpose of hidden layer representation for center document, +.>
Figure QLYQS_47
Hidden layer representation for neighborhood document, +.>
Figure QLYQS_56
Attention coefficients for the central document hidden layer representation and the neighborhood document hidden layer representation, +.>
Figure QLYQS_50
For document-topic distribution, < >>
Figure QLYQS_60
For an unnormalized central document theme representation, +.>
Figure QLYQS_51
Representing the normalization function.
5. A variational neighborhood encoder, applied to the document network topic modeling method of any of claims 1-4, said variational neighborhood encoder comprising:
an input layer for determining document input representations of respective documents in the document network set, respectively;
the coding layer is used for carrying out coding processing on the document input representation of each document to obtain a hidden layer representation of each document, and carrying out re-parameterization and attention mechanism processing on the hidden layer representation to obtain a theme representation of the central document;
an attention layer for gathering a topic representation of the neighborhood document and the center document of each document using dot product attention to obtain a representation of the center document;
and a decoder for determining a document-topic distribution from the representation of the central document and a topic-word distribution from the document-topic distribution.
6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.
CN202310135750.9A 2023-02-20 2023-02-20 Document network theme modeling method, variation neighborhood encoder, terminal and medium Active CN115879515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310135750.9A CN115879515B (en) 2023-02-20 2023-02-20 Document network theme modeling method, variation neighborhood encoder, terminal and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310135750.9A CN115879515B (en) 2023-02-20 2023-02-20 Document network theme modeling method, variation neighborhood encoder, terminal and medium

Publications (2)

Publication Number Publication Date
CN115879515A CN115879515A (en) 2023-03-31
CN115879515B true CN115879515B (en) 2023-05-12

Family

ID=85761364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310135750.9A Active CN115879515B (en) 2023-02-20 2023-02-20 Document network theme modeling method, variation neighborhood encoder, terminal and medium

Country Status (1)

Country Link
CN (1) CN115879515B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112836017A (en) * 2021-02-09 2021-05-25 天津大学 Event detection method based on hierarchical theme-driven self-attention mechanism

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2386039B (en) * 2002-03-01 2005-07-06 Fujitsu Ltd Data encoding and decoding apparatus and a data encoding and decoding method
US10346524B1 (en) * 2018-03-29 2019-07-09 Sap Se Position-dependent word salience estimation
CN110457708B (en) * 2019-08-16 2023-05-16 腾讯科技(深圳)有限公司 Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN111949790A (en) * 2020-07-20 2020-11-17 重庆邮电大学 Emotion classification method based on LDA topic model and hierarchical neural network
CN112199607A (en) * 2020-10-30 2021-01-08 天津大学 Microblog topic mining method based on fusion of parallel social contexts in variable neighborhood
CN113434664B (en) * 2021-06-30 2024-07-16 平安科技(深圳)有限公司 Text abstract generation method, device, medium and electronic equipment
CN114116974A (en) * 2021-11-19 2022-03-01 深圳市东汇精密机电有限公司 Emotional cause extraction method based on attention mechanism
CN114281990A (en) * 2021-12-17 2022-04-05 北京百度网讯科技有限公司 Document classification method and device, electronic equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112836017A (en) * 2021-02-09 2021-05-25 天津大学 Event detection method based on hierarchical theme-driven self-attention mechanism

Also Published As

Publication number Publication date
CN115879515A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
Dhingra et al. Embedding text in hyperbolic spaces
US11886955B2 (en) Self-supervised data obfuscation in foundation models
CN111930895B (en) MRC-based document data retrieval method, device, equipment and storage medium
Xu et al. Microblog dimensionality reduction—a deep learning approach
CN113408706B (en) Method and device for training user interest mining model and user interest mining
CN113590761A (en) Training method of text processing model, text processing method and related equipment
Du et al. Matrix factorization techniques in machine learning, signal processing, and statistics
CN110705279A (en) Vocabulary selection method and device and computer readable storage medium
CN117312777A (en) Industrial equipment time sequence generation method and device based on diffusion model
CN115169342A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN112307738B (en) Method and device for processing text
Sheng et al. LA-ESN: a novel method for time series classification
CN115879515B (en) Document network theme modeling method, variation neighborhood encoder, terminal and medium
JP2019021218A (en) Learning device, program parameter, learning method and model
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
Xu et al. Treelstm with tag-aware hypernetwork for sentence representation
WO2024091291A1 (en) Self-supervised data obfuscation in foundation models
CN110889293A (en) Method, device, equipment and storage medium for constructing multi-level theme vector space
CN111723186A (en) Knowledge graph generation method based on artificial intelligence for dialog system and electronic equipment
Zhang Clustering high-dimensional time series based on parallelism
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
Liu et al. Machine learning and data analysis for word segmentation of classical Chinese poems: illustrations with Tang and Song examples
Ye et al. Data Preparation and Engineering
Chao et al. Deep cross-dimensional attention hashing for image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant