CN115879515B - Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media - Google Patents

Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media Download PDF

Info

Publication number
CN115879515B
CN115879515B CN202310135750.9A CN202310135750A CN115879515B CN 115879515 B CN115879515 B CN 115879515B CN 202310135750 A CN202310135750 A CN 202310135750A CN 115879515 B CN115879515 B CN 115879515B
Authority
CN
China
Prior art keywords
document
representation
neighborhood
topic
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310135750.9A
Other languages
Chinese (zh)
Other versions
CN115879515A (en
Inventor
刘德喜
张子靖
刘嘉鸣
万齐智
邓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi University of Finance and Economics
Original Assignee
Jiangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi University of Finance and Economics filed Critical Jiangxi University of Finance and Economics
Priority to CN202310135750.9A priority Critical patent/CN115879515B/en
Publication of CN115879515A publication Critical patent/CN115879515A/en
Application granted granted Critical
Publication of CN115879515B publication Critical patent/CN115879515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供了一种文档网络主题建模方法、变分邻域编码器、终端及介质,该方法包括:获取文档网络集,并分别确定所述文档网络集中各文档的文档输入表示;将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到各文档的隐藏层表示,并根据所述隐藏层表示确定中心文档的表示;根据所述中心文档的表示确定文档‑主题分布,并根据所述文档‑主题分布确定主题‑词分布。本发明基于各文档的隐藏层表示能有效地确定到中心文档的表示,基于中心文档的表示能有效地确定到文档‑主题分布,基于文档‑主题分布能有效地确定到主题‑词分布,以达到对文档网络的主题建模效果。

Figure 202310135750

The present invention provides a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium. The method includes: obtaining a document network set, and respectively determining the document input representation of each document in the document network set; The document input representation of the document is input into the pre-trained variational neighborhood encoder for encoding processing to obtain the hidden layer representation of each document, and determine the representation of the central document according to the representation of the hidden layer; determine the representation of the central document according to the representation of the central document -topic distribution, and determine topic-term distribution based on said document-topic distribution. The present invention can effectively determine the representation of the central document based on the hidden layer representation of each document, can effectively determine the document-topic distribution based on the representation of the central document, and can effectively determine the topic-word distribution based on the document-topic distribution, so that To achieve the effect of topic modeling on the document network.

Figure 202310135750

Description

文档网络主题建模方法、变分邻域编码器、终端及介质Document network topic modeling method, variational neighborhood encoder, terminal and medium

技术领域Technical Field

本发明涉及主题建模技术领域,尤其涉及一种文档网络主题建模方法、变分邻域编码器、终端及介质。The present invention relates to the technical field of topic modeling, and in particular to a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium.

背景技术Background Art

文档网络是由文档及其关系组成的网络,例如,由学术论文互相引用组成的网络,由网页文本互相链接组成的网络等。文档网络作为文本数据的重要组成部分,获取文档网络中文档的主题能够让人们更好地理解文档的内容分布,因此,如何有效地对文档网络中的文档进行主题建模,是目前亟需解决的问题。A document network is a network composed of documents and their relationships, for example, a network composed of academic papers citing each other, a network composed of web page texts linking each other, etc. As an important part of text data, document networks can help people better understand the content distribution of documents by obtaining the topics of documents in document networks. Therefore, how to effectively perform topic modeling on documents in document networks is a problem that needs to be solved urgently.

发明内容Summary of the invention

本发明实施例的目的在于提供一种文档网络主题建模方法、变分邻域编码器、终端及介质,旨在解决现有技术中,如何能有效地对文档网络中的文档进行主题建模的问题。The purpose of the embodiments of the present invention is to provide a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium, aiming to solve the problem in the prior art of how to effectively perform topic modeling on documents in a document network.

本发明实施例是这样实现的,一种文档网络主题建模方法,所述方法包括:The embodiment of the present invention is implemented as follows: a document network topic modeling method, the method comprising:

获取文档网络集,并分别确定所述文档网络集中各文档的文档输入表示;Obtaining a document network set, and determining a document input representation of each document in the document network set;

将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到隐藏层表示,并根据所述隐藏层表示确定中心文档的表示;Inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain a hidden layer representation, and determining the representation of the central document according to the hidden layer representation;

根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布。A document-topic distribution is determined based on the representation of the central document, and a topic-word distribution is determined based on the document-topic distribution.

进一步地,所述分别确定所述文档网络集中各文档的文档输入表示采用的公式包括:Furthermore, the formula used to respectively determine the document input representation of each document in the document network set includes:

Figure SMS_1
Figure SMS_1
;

Figure SMS_2
Figure SMS_2
;

Figure SMS_3
Figure SMS_3
;

其中,V表示文档集中词构成的词典,

Figure SMS_6
表示文档
Figure SMS_11
与文档
Figure SMS_14
之间最短路径的长度,
Figure SMS_5
为文档
Figure SMS_9
中单词
Figure SMS_13
出现的次数,
Figure SMS_16
是文本向量,
Figure SMS_4
为0-1邻域向量,
Figure SMS_8
是高阶邻域向量,
Figure SMS_12
代表单词
Figure SMS_15
在文档
Figure SMS_7
中的权重,
Figure SMS_10
表示中心文档。Among them, V represents the dictionary composed of words in the document set,
Figure SMS_6
Representation Document
Figure SMS_11
With Documentation
Figure SMS_14
The length of the shortest path between
Figure SMS_5
For Documentation
Figure SMS_9
Chinese words
Figure SMS_13
The number of occurrences,
Figure SMS_16
is a text vector,
Figure SMS_4
is a 0-1 neighborhood vector,
Figure SMS_8
is a high-order neighborhood vector,
Figure SMS_12
Representative words
Figure SMS_15
In the documentation
Figure SMS_7
The weight in
Figure SMS_10
Represents the central document.

进一步地,其特征在于,所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理之前,还包括:Furthermore, it is characterized in that before the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding processing, it also includes:

获取各样本文档的样本输入表示,并将各样本文档的样本输入表示输入所述变分邻域编码器进行编码处理,得到样本推断分布参数;Obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter;

根据所述样本推断分布参数确定样本主题表示,并根据所述样本主题表示对各样本文档进行重构,得到重构文档;Determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document;

根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失,并根据各样本文档和所述重构文档确定重构损失;Determine a priori loss according to the sample inference distribution parameter and the prior normal distribution parameter of each sample document, and determine reconstruction loss according to each sample document and the reconstructed document;

根据所述先验损失和所述重构损失对所述变分邻域编码器进行参数更新,直至所述变分邻域编码器收敛,得到预训练后的所述变分邻域编码器。The variational neighborhood encoder is parameter updated according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges, thereby obtaining the pre-trained variational neighborhood encoder.

进一步地,所述根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失,并根据各样本文档和所述重构文档确定重构损失采用的公式包括:Furthermore, the formula used for determining the prior loss according to the sample inference distribution parameters and the prior normal distribution parameters of each sample document and determining the reconstruction loss according to each sample document and the reconstructed document includes:

Figure SMS_17
Figure SMS_17
;

其中,

Figure SMS_19
为各样本文档的邻域文档,
Figure SMS_23
为由隐藏主题重新生成的邻域文档,
Figure SMS_26
表示总损失,
Figure SMS_20
表示重构损失,
Figure SMS_22
表示先验损失,
Figure SMS_25
表示权重参数,
Figure SMS_28
为样本文档中的单词,
Figure SMS_18
为重构样本中的单词,KL(·)表示所述样本推断分布参数和先验正态分布参数的KL散度,μσ分别为所述变分邻域编码器中推断网络推断出的推断分布的均值与方差,
Figure SMS_21
Figure SMS_24
为所述先验正态分布参数的均值与方差,
Figure SMS_27
为正态分布。in,
Figure SMS_19
is the neighborhood document of each sample document,
Figure SMS_23
is the neighborhood document regenerated from the hidden topic,
Figure SMS_26
represents the total loss,
Figure SMS_20
represents the reconstruction loss,
Figure SMS_22
represents the prior loss,
Figure SMS_25
represents the weight parameter,
Figure SMS_28
are the words in the sample documents,
Figure SMS_18
To reconstruct the words in the sample, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter, μ and σ are the mean and variance of the inference distribution inferred by the inference network in the variational neighborhood encoder, respectively.
Figure SMS_21
and
Figure SMS_24
are the mean and variance of the prior normal distribution parameters,
Figure SMS_27
is a normal distribution.

进一步地,所述根据所述隐藏层表示确定中心文档的表示,包括:Further, determining the representation of the central document according to the hidden layer representation includes:

对所述隐藏层表示进行重参数化和注意力机制处理,得到所述中心文档的主题表示;Reparameterizing and processing the hidden layer representation with an attention mechanism to obtain a topic representation of the central document;

使用点积注意力机制聚集各文档的邻域文档和所述中心文档的主题表示,得到所述中心文档的表示。The dot product attention mechanism is used to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document.

进一步地,所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到隐藏层表示采用的公式包括:Furthermore, the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding, and the formula used to obtain the hidden layer representation includes:

Figure SMS_29
Figure SMS_29
;

其中,

Figure SMS_30
表示激活函数,
Figure SMS_33
均表示所述变分邻域编码器中编码器对应全连接层的训练参数,
Figure SMS_36
Figure SMS_32
Figure SMS_35
Figure SMS_38
Figure SMS_39
表示全体实数空间,t为主题个数,m为词典大小,
Figure SMS_31
表示对数化方差,
Figure SMS_34
表示中心文档的隐藏层表示,
Figure SMS_37
表示从规模与μσ相同的多元标准正态分布中随机生成得到的样本表示。in,
Figure SMS_30
represents the activation function,
Figure SMS_33
Both represent the training parameters of the encoder corresponding to the fully connected layer in the variational neighborhood encoder,
Figure SMS_36
,
Figure SMS_32
,
Figure SMS_35
and
Figure SMS_38
,
Figure SMS_39
represents the entire real number space, t is the number of topics, m is the dictionary size,
Figure SMS_31
represents the logarithmic variance,
Figure SMS_34
represents the hidden layer representation of the central document,
Figure SMS_37
represents a sample representation randomly generated from a multivariate standard normal distribution of the same size as μ and σ .

进一步地,所述根据所述中心文档的表示确定文档-主题分布采用的公式包括:Furthermore, the formula used to determine the document-topic distribution according to the representation of the central document includes:

Figure SMS_40
Figure SMS_40
;

其中,N(d)表示与中心文档d之间存在路径的邻域文档集合,

Figure SMS_43
表示中心文档d的邻域文档,
Figure SMS_46
表示标准对数正态分布,weight表示邻域文档对中心文档的影响程度,
Figure SMS_49
为中心文档d与邻域文档
Figure SMS_42
之间的最短路径长度,
Figure SMS_45
为中心文档与邻域文档的关联程度,
Figure SMS_48
为中心文档隐藏层表示的转置,
Figure SMS_50
为邻域文档的隐藏层表示,
Figure SMS_41
为中心文档隐藏层表示与邻域文档隐藏层表示的注意力系数,θ为文档-主题分布,
Figure SMS_44
为未归一化的中心文档主题表示,
Figure SMS_47
表示归一化函数。Where N ( d ) represents the set of neighborhood documents that have paths with the central document d .
Figure SMS_43
represents the neighborhood documents of the central document d ,
Figure SMS_46
represents the standard log-normal distribution, weight represents the influence of the neighboring document on the central document,
Figure SMS_49
is the central document d and the neighboring documents
Figure SMS_42
The shortest path length between
Figure SMS_45
is the correlation degree between the central document and the neighboring documents,
Figure SMS_48
is the transpose of the hidden layer representation of the central document,
Figure SMS_50
is the hidden layer representation of the neighborhood documents,
Figure SMS_41
is the attention coefficient between the hidden layer representation of the central document and the hidden layer representation of the neighborhood document, θ is the document-topic distribution,
Figure SMS_44
is the unnormalized central document topic representation,
Figure SMS_47
Represents the normalization function.

本发明实施例的另一目的在于提供一种变分邻域编码器,其中,应用于任意一项所述的文档网络主题建模方法,所述变分邻域编码器包括:Another object of an embodiment of the present invention is to provide a variational neighborhood encoder, wherein the variational neighborhood encoder is applied to any one of the document network topic modeling methods, and comprises:

输入层,用于分别确定文档网络集中各文档的文档输入表示;The input layer is used to determine the document input representation of each document in the document network set respectively;

编码层,用于对各文档的文档输入表示进行编码处理,得到各文档的隐藏层表示,并对所述隐藏层表示进行重参数化与注意力机制处理,得到中心文档的主题表示;The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with an attention mechanism to obtain the topic representation of the central document;

注意力层,用于使用点积注意力聚集各文档的邻域文档与所述中心文档的主题表示,得到所述中心文档的表示;An attention layer, used to aggregate the neighborhood documents of each document and the topic representation of the central document using dot product attention to obtain the representation of the central document;

解码器,用于根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布。A decoder is configured to determine a document-topic distribution based on the representation of the central document, and determine a topic-word distribution based on the document-topic distribution.

本发明实施例的另一目的在于提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述方法的步骤。Another object of an embodiment of the present invention is to provide a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the computer program.

本发明实施例的另一目的在于提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述方法的步骤。Another object of an embodiment of the present invention is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the above method are implemented.

本发明实施例,通过分别确定文档网络集中各文档的文档输入表示,基于文档输入表示能有效地对各文档起到编码效果,通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,能有效地推断出各文档对应的隐藏层表示,基于隐藏层表示能有效地确定到中心文档的表示,基于中心文档的表示能有效地确定到文档-主题分布,基于文档-主题分布能有效地确定到主题-词分布,以达到对文档的主题建模效果。The embodiments of the present invention can effectively encode each document by respectively determining the document input representation of each document in the document network set, and can effectively infer the hidden layer representation corresponding to each document by inputting the document input representation of each document into a pre-trained variational neighborhood encoder for encoding processing, and can effectively determine the representation of the central document based on the hidden layer representation, and can effectively determine the document-topic distribution based on the representation of the central document, and can effectively determine the topic-word distribution based on the document-topic distribution, so as to achieve the topic modeling effect of the document.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明第一实施例提供的文档网络主题建模方法的流程图;FIG1 is a flow chart of a document network topic modeling method provided by a first embodiment of the present invention;

图2是本发明第二实施例提供的文档网络主题建模方法的流程图;2 is a flow chart of a document network topic modeling method provided by a second embodiment of the present invention;

图3是本发明第三实施例提供的变分邻域编码器的结构示意图;3 is a schematic diagram of the structure of a variational neighborhood encoder provided by a third embodiment of the present invention;

图4是本发明第四实施例提供的终端设备的结构示意图。FIG. 4 is a schematic diagram of the structure of a terminal device provided in a fourth embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

为了说明本发明所述的技术方案,下面通过具体实施例来进行说明。In order to illustrate the technical solution of the present invention, a specific embodiment is provided below for illustration.

实施例一Embodiment 1

请参阅图1,是本发明第一实施例提供的文档网络主题建模方法的流程图,该文档网络主题建模方法可以应用于任一终端设备或系统,该文档网络主题建模方法包括步骤:Please refer to FIG. 1 , which is a flowchart of a document network topic modeling method provided by a first embodiment of the present invention. The document network topic modeling method can be applied to any terminal device or system. The document network topic modeling method includes the following steps:

步骤S10,获取文档网络集,并分别确定所述文档网络集中各文档的文档输入表示;Step S10, obtaining a document network set, and determining the document input representation of each document in the document network set respectively;

其中,给定一个包含n个文档的文档网络集

Figure SMS_51
,由D中的词构成的词典记作V,其中包含m个单词,文档网络集D可表示为文档-词矩阵
Figure SMS_52
。其中,
Figure SMS_53
代表单词
Figure SMS_54
在文档
Figure SMS_55
中的权重(例如:TF-IDF)。Among them, given a document network set containing n documents
Figure SMS_51
, the dictionary composed of words in D is denoted as V , which contains m words. The document network set D can be represented as a document-word matrix
Figure SMS_52
.in,
Figure SMS_53
Representative words
Figure SMS_54
In the documentation
Figure SMS_55
The weights in (e.g. TF-IDF).

D中文档的关系用0-1邻域矩阵表示,即

Figure SMS_56
,其中元素
Figure SMS_57
为1表示
Figure SMS_58
Figure SMS_59
存在边,为0表示不存在边,文档网络G表示为
Figure SMS_60
,其中D表示文档网络集,A表示D中文档的邻域矩阵,X表示文档-词矩阵,V表示D中的词构成的词典;The relationship between documents in D is represented by a 0-1 neighborhood matrix, that is,
Figure SMS_56
, where the elements
Figure SMS_57
1 means
Figure SMS_58
,
Figure SMS_59
There is an edge, and 0 means there is no edge. The document network G is represented as
Figure SMS_60
, where D represents the document network set, A represents the neighborhood matrix of the documents in D , X represents the document-word matrix, and V represents the dictionary composed of the words in D ;

该步骤中,为了建模高阶图结构信息,邻域矩阵A中不仅要记录两个文档的直接连接关系,即一阶邻域信息,还应记录二阶甚至更高阶的邻域信息,高阶邻域矩阵

Figure SMS_61
中元素的定义如公式(1)所示。其中,
Figure SMS_62
表示文档
Figure SMS_63
与文档
Figure SMS_64
之间路径的长度;In this step, in order to model high-order graph structure information, the neighborhood matrix A should not only record the direct connection relationship between two documents, that is, the first-order neighborhood information, but also the second-order or even higher-order neighborhood information.
Figure SMS_61
The definition of the elements in is shown in formula (1).
Figure SMS_62
Representation Document
Figure SMS_63
With Documentation
Figure SMS_64
The length of the path between them;

Figure SMS_65
;公式(1)
Figure SMS_65
; Formula (1)

对于文档-词矩阵X,使用对数正则化的方式初始化,如公式(2)所示。其中,

Figure SMS_66
为文档
Figure SMS_67
中单词
Figure SMS_68
出现的次数:For the document-word matrix X , logarithmic regularization is used for initialization, as shown in formula (2).
Figure SMS_66
For Documentation
Figure SMS_67
Chinese words
Figure SMS_68
Number of occurrences:

Figure SMS_69
;公式(2)
Figure SMS_69
; Formula (2)

最后,如公式(3)所示将文本向量x、邻域向量a或高阶邻域向量

Figure SMS_70
结合,得到各文档的文档输入表示:Finally, as shown in formula (3), the text vector x , the neighborhood vector a or the high-order neighborhood vector
Figure SMS_70
Combined, we get the document input representation of each document:

Figure SMS_71
;公式(3)
Figure SMS_71
; Formula (3)

步骤S20,将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到各文档的隐藏层表示,并根据所述隐藏层表示确定中心文档的表示;Step S20, inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain the hidden layer representation of each document, and determining the representation of the central document according to the hidden layer representation;

可选的,该步骤中,所述根据所述隐藏层表示确定中心文档的表示,包括:Optionally, in this step, determining the representation of the central document according to the hidden layer representation includes:

对所述隐藏层表示进行重参数化和注意力机制处理,得到所述中心文档的主题表示;Reparameterizing and processing the hidden layer representation with an attention mechanism to obtain a topic representation of the central document;

使用点积注意力机制聚集各文档的邻域文档和所述中心文档的主题表示,得到所述中心文档的表示;Use the dot product attention mechanism to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document;

其中,预训练后的变分邻域编码器(Variational Adjacent-Encoder, VADJE)通过一个全连接层对中心文档进行编码,推断出各文档的隐藏层表示,然后通过重参数化与注意力机制得到中心文档的隐藏层表示

Figure SMS_72
。The pre-trained variational neighbor encoder (VADJE) encodes the central document through a fully connected layer, infers the hidden layer representation of each document, and then obtains the hidden layer representation of the central document through reparameterization and attention mechanism.
Figure SMS_72
.

本实施例使用正态分布作为先验分布,变分邻域编码器使用反正切函数作为编码阶段全连接层的激活函数,并使用Xavier Glorot对训练参数进行初始化。编码阶段的全连接层与重参数化过程如公式(4)所示:This embodiment uses the normal distribution as the prior distribution, the variational neighborhood encoder uses the inverse tangent function as the activation function of the fully connected layer in the encoding stage, and uses Xavier Glorot to initialize the training parameters. The fully connected layer and reparameterization process in the encoding stage are shown in formula (4):

Figure SMS_73
;公式(4)
Figure SMS_73
; Formula (4)

其中,

Figure SMS_76
表示激活函数,
Figure SMS_79
均表示所述变分邻域编码器中编码器对应全连接层的训练参数,
Figure SMS_83
Figure SMS_75
Figure SMS_77
Figure SMS_80
Figure SMS_82
表示全体实数空间,t为主题个数,m为词典大小,
Figure SMS_74
表示对数化方差,
Figure SMS_78
表示中心文档的隐藏层表示,
Figure SMS_81
表示从规模与μσ相同的多元标准正态分布中随机生成得到的样本表示。重参数化的思想是通过从标准正态分布
Figure SMS_84
中采样一个变量,再对该变量进行仿射变换得到所需隐变量,用于类VAE模型中反向传播的问题。in,
Figure SMS_76
represents the activation function,
Figure SMS_79
Both represent the training parameters of the encoder corresponding to the fully connected layer in the variational neighborhood encoder,
Figure SMS_83
,
Figure SMS_75
,
Figure SMS_77
and
Figure SMS_80
,
Figure SMS_82
represents the entire real number space, t is the number of topics, m is the dictionary size,
Figure SMS_74
represents the logarithmic variance,
Figure SMS_78
represents the hidden layer representation of the central document,
Figure SMS_81
represents the sample representation randomly generated from a multivariate standard normal distribution of the same size as μ and σ . The idea of reparameterization is to generate
Figure SMS_84
A variable is sampled from the dataset, and then an affine transformation is performed on the variable to obtain the required latent variable, which is used for the back propagation problem in the VAE-like model.

步骤S30,根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布;Step S30, determining a document-topic distribution according to the representation of the central document, and determining a topic-word distribution according to the document-topic distribution;

其中,在重参数化后,变分邻域编码器使用点积注意力聚集邻域文档

Figure SMS_85
与中心文档的主题表示
Figure SMS_86
,得到未归一化的中心文档主题表示
Figure SMS_87
,随后用softmax函数将
Figure SMS_88
转换为文档的文档-主题分布θ,具体过程如公式(5)所示:Among them, after reparameterization, the variational neighborhood encoder uses dot product attention to aggregate neighborhood documents
Figure SMS_85
Theme representation with central document
Figure SMS_86
, get the unnormalized central document topic representation
Figure SMS_87
, and then use the softmax function to
Figure SMS_88
Converted to the document-topic distribution θ of the document, the specific process is shown in formula (5):

Figure SMS_89
;公式(5)
Figure SMS_89
; Formula (5)

其中,d表示中心文档,N(d)表示与中心文档d之间存在路径的邻域文档集合,

Figure SMS_91
表示中心文档d的邻域文档,
Figure SMS_95
表示标准对数正态分布,weight表示邻域文档对中心文档的影响程度,
Figure SMS_98
为中心文档d与邻域文档
Figure SMS_92
之间的最短路径长度,
Figure SMS_94
为中心文档与邻域文档的关联程度,
Figure SMS_97
为中心文档隐藏层表示的转置,
Figure SMS_99
为邻域文档的隐藏层表示,
Figure SMS_90
为中心文档隐藏层表示与邻域文档隐藏层表示的注意力系数,θ为文档-主题分布,
Figure SMS_93
为未归一化的中心文档主题表示,
Figure SMS_96
表示归一化函数。Where d represents the central document, N ( d ) represents the set of neighborhood documents that have paths with the central document d ,
Figure SMS_91
represents the neighborhood documents of the central document d ,
Figure SMS_95
represents the standard log-normal distribution, weight represents the influence of the neighboring document on the central document,
Figure SMS_98
is the central document d and the neighboring documents
Figure SMS_92
The shortest path length between
Figure SMS_94
is the correlation degree between the central document and the neighboring documents,
Figure SMS_97
is the transpose of the hidden layer representation of the central document,
Figure SMS_99
is the hidden layer representation of the neighborhood documents,
Figure SMS_90
is the attention coefficient between the hidden layer representation of the central document and the hidden layer representation of the neighborhood document, θ is the document-topic distribution,
Figure SMS_93
is the unnormalized central document topic representation,
Figure SMS_96
Represents the normalization function.

该步骤中,在解码阶段,变分邻域编码器基于文档网络中存在的文档关系,不仅用中心文档的隐藏主题生成中心文档本身,还生成其邻域文档,如公式(6)所示:In this step, in the decoding stage, the variational neighborhood encoder generates not only the central document itself but also its neighborhood documents using the hidden topics of the central document based on the document relations existing in the document network, as shown in formula (6):

Figure SMS_100
;公式(6)
Figure SMS_100
; Formula (6)

其中,

Figure SMS_101
为激活函数,
Figure SMS_102
Figure SMS_103
为解码器中对应全连接层的可训练参数,
Figure SMS_104
为由隐藏主题重新生成的邻域文档,将解码器中的权重与偏置参数进行softmax变化可以得到主题-词分布β。in,
Figure SMS_101
is the activation function,
Figure SMS_102
and
Figure SMS_103
are the trainable parameters of the corresponding fully connected layer in the decoder,
Figure SMS_104
For the neighborhood documents regenerated from the hidden topics, the topic-word distribution β can be obtained by performing softmax transformation on the weights and bias parameters in the decoder.

本实施例中,基于变分邻域编码器的文档网络生成步骤包括:In this embodiment, the document network generation step based on the variational neighborhood encoder includes:

生成文档时,首先通过变分邻域编码器中的推断网络得到对应的分布参数

Figure SMS_105
Figure SMS_106
Figure SMS_107
Figure SMS_108
分别表示VADJE的均值推断网络和标准差推断网络;When generating a document, the corresponding distribution parameters are first obtained through the inference network in the variational neighborhood encoder
Figure SMS_105
and
Figure SMS_106
,
Figure SMS_107
and
Figure SMS_108
Represent the mean inference network and standard deviation inference network of VADJE respectively;

通过重参数化生成文档的主题分布

Figure SMS_109
。对于给定文本,每个单词从对应的文本的词分布中生成,而文本的词分布可由文档的主题分布
Figure SMS_110
与主题的词分布
Figure SMS_111
得到,且为多项分布,即:Generating topic distribution of documents by reparameterization
Figure SMS_109
For a given text, each word is generated from the word distribution of the corresponding text, and the word distribution of the text can be obtained from the topic distribution of the document
Figure SMS_110
Word distribution with topic
Figure SMS_111
We get , and it is a multinomial distribution, that is:

Figure SMS_112
Figure SMS_112
;

其中,

Figure SMS_113
表示中心文档d的单词,
Figure SMS_114
表示多项分布,文档连接生成时,将其建模为伯努利二元变量,根据文档的主题分布计算连接存在的概率,即
Figure SMS_115
,其中
Figure SMS_116
表示神经网络的全连接层,
Figure SMS_117
表示伯努利分布。in,
Figure SMS_113
represents the word of the central document d ,
Figure SMS_114
Represents a multinomial distribution. When a document connection is generated, it is modeled as a Bernoulli binary variable, and the probability of the connection existing is calculated based on the topic distribution of the document, that is,
Figure SMS_115
,in
Figure SMS_116
represents the fully connected layer of the neural network,
Figure SMS_117
represents a Bernoulli distribution.

具体的,对于每个文档

Figure SMS_118
;Specifically, for each document
Figure SMS_118
;

生成一个均值向量

Figure SMS_119
Figure SMS_120
;Generate a mean vector
Figure SMS_119
:
Figure SMS_120
;

生成对数协方差

Figure SMS_121
Figure SMS_122
;Generate log covariance
Figure SMS_121
:
Figure SMS_122
;

生成多元标准正态分布的样本

Figure SMS_123
Figure SMS_124
;Generate samples from a multivariate standard normal distribution
Figure SMS_123
:
Figure SMS_124
;

生成文本主题分布

Figure SMS_125
Figure SMS_126
;Generating text topic distribution
Figure SMS_125
:
Figure SMS_126
;

对于每一个单词

Figure SMS_127
;For each word
Figure SMS_127
;

生成单词

Figure SMS_128
;Generate words
Figure SMS_128
;

对于每一对文档d

Figure SMS_129
;For every pair of documents d and
Figure SMS_129
;

生成连接:

Figure SMS_130
。Generate a connection:
Figure SMS_130
.

本实施例中,通过分别确定文档网络集中各文档的文档输入表示,基于文档输入表示能有效地对各文档起到编码效果,通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,能有效地推断出各文档对应的隐藏层表示,基于隐藏层表示能有效地确定到中心文档的表示,基于中心文档的表示能有效地确定到文档-主题分布,基于文档-主题分布能有效地确定到主题-词分布,以达到对文档的主题建模效果。In this embodiment, by respectively determining the document input representation of each document in the document network set, each document can be effectively encoded based on the document input representation, and by inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing, the hidden layer representation corresponding to each document can be effectively inferred, and the representation of the central document can be effectively determined based on the hidden layer representation, and the document-topic distribution can be effectively determined based on the representation of the central document, and the topic-word distribution can be effectively determined based on the document-topic distribution, so as to achieve the topic modeling effect of the document.

实施例二Embodiment 2

请参阅图2,是本发明第二实施例提供的文档网络主题建模方法的流程图,该实施例用于对第一实施例中的步骤S20之前的步骤作进一步细化,包括步骤:Please refer to FIG. 2 , which is a flow chart of a document network topic modeling method provided by a second embodiment of the present invention. This embodiment is used to further refine the steps before step S20 in the first embodiment, including the steps of:

步骤S40,获取各样本文档的样本输入表示,并将各样本文档的样本输入表示输入所述变分邻域编码器进行编码处理,得到样本推断分布参数;Step S40, obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter;

其中,基于公式(1)至公式(3),分别获取各样本文档的样本输入表示,并将各样本文档的样本输入表示输入变分邻域编码器进行编码处理,得到样本推断分布参数;Wherein, based on formula (1) to formula (3), the sample input representation of each sample document is obtained respectively, and the sample input representation of each sample document is input into the variational neighborhood encoder for encoding processing to obtain the sample inference distribution parameter;

步骤S50,根据所述样本推断分布参数确定样本主题表示,并根据所述样本主题表示对各样本文档进行重构,得到重构文档;Step S50, determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document;

其中,通过对样本主题表示进行重参数化和注意力机制处理,得到样本主题表示,在重参数化后,变分邻域编码器使用点积注意力聚集各样本文档的邻域文档和样本主题表示,得到样本表示,随后用softmax函数将样本表示转换为样本文档-主题分布,基于样本文档-主题分布确定样本主题-词分布,基于样本主题-词分布对各样本文档进行重构,得到重构文档;The sample topic representation is obtained by reparameterizing and processing the attention mechanism on the sample topic representation. After reparameterization, the variational neighborhood encoder uses dot product attention to aggregate the neighborhood documents and sample topic representations of each sample document to obtain the sample representation. Then, the sample representation is converted into a sample document-topic distribution using a softmax function. The sample topic-word distribution is determined based on the sample document-topic distribution. Based on the sample topic-word distribution, each sample document is reconstructed to obtain a reconstructed document.

步骤S60,根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失,并根据各样本文档和所述重构文档确定重构损失;Step S60, determining a priori loss according to the sample inference distribution parameter of each sample document and the prior normal distribution parameter, and determining reconstruction loss according to each sample document and the reconstructed document;

其中,在模型训练阶段,对于每个文档,变分邻域编码器的损失函数分为重构损失和先验损失两个部分:重构损失为重构文档与原文档的二项交叉熵,先验损失是推断网络得到的推断分布与先验正态分布之间的KL散度,如公式(7)所示:Among them, in the model training stage, for each document, the loss function of the variational neighborhood encoder is divided into two parts: reconstruction loss and prior loss: the reconstruction loss is the binomial cross entropy between the reconstructed document and the original document, and the prior loss is the KL divergence between the inferred distribution obtained by the inference network and the prior normal distribution, as shown in formula (7):

Figure SMS_131
;公式(7)
Figure SMS_131
; Formula (7)

其中,

Figure SMS_132
为各样本文档的邻域文档(d也是自己的邻域文档之一),
Figure SMS_133
为由隐藏主题重新生成的邻域文档,KL(·)表示所述样本推断分布参数和先验正态分布参数的KL散度,μσ为所述变分邻域编码器中推断网络推断出的推断分布的均值与方差,所述推断分布的形式为正态分布,参数包括μσ
Figure SMS_134
Figure SMS_135
为所述先验正态参数的均值与方差,
Figure SMS_136
为正态分布。in,
Figure SMS_132
is the neighborhood document of each sample document ( d is also one of its own neighborhood documents),
Figure SMS_133
is a neighborhood document regenerated from a hidden topic, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter, μ and σ are the mean and variance of the inference distribution inferred by the inference network in the variational neighborhood encoder, the inference distribution is in the form of a normal distribution, and the parameters include μ and σ ,
Figure SMS_134
and
Figure SMS_135
are the mean and variance of the prior normal parameters,
Figure SMS_136
is a normal distribution.

步骤S70,根据所述先验损失和所述重构损失对所述变分邻域编码器进行参数更新,直至所述变分邻域编码器收敛,得到预训练后的所述变分邻域编码器;Step S70, updating parameters of the variational neighborhood encoder according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges to obtain the pre-trained variational neighborhood encoder;

其中,若变分邻域编码器的当前迭代次数大于或等于次数阈值,则判定该变分邻域编码器收敛,该次数阈值可以根据需求进行设置。If the current iteration number of the variational neighborhood encoder is greater than or equal to the number threshold, the variational neighborhood encoder is determined to be converged, and the number threshold can be set according to requirements.

本实施例中,通过将各样本文档的样本输入表示输入变分邻域编码器进行编码处理,能有效地得到各样本文档对应的样本推断分布参数,基于样本推断分布参数能有效地确定到样本主题表示,基于样本主题表示能有效地对各样本文档进行重构,得到重构文档,基于各样本文档的样本推断分布参数和先验正态分布参数,能有效地确定到变分邻接编码器的先验损失,基于各样本文档和重构文档能有效地确定到变分邻域编码器的重构损失,基于先验损失和重构损失能有效地对变分邻域编码器进行参数更新,提高了变分邻域编码器中参数的准确性,提高了文档网络主题建模的准确性。In this embodiment, by inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing, the sample inference distribution parameters corresponding to each sample document can be effectively obtained, and the sample topic representation can be effectively determined based on the sample inference distribution parameters. Based on the sample topic representation, each sample document can be effectively reconstructed to obtain a reconstructed document. Based on the sample inference distribution parameters and the prior normal distribution parameters of each sample document, the prior loss of the variational neighboring encoder can be effectively determined. Based on each sample document and the reconstructed document, the reconstruction loss of the variational neighborhood encoder can be effectively determined. Based on the prior loss and the reconstruction loss, the variational neighborhood encoder parameters can be effectively updated, thereby improving the accuracy of the parameters in the variational neighborhood encoder and improving the accuracy of document network topic modeling.

实施例三Embodiment 3

请参阅图3,是本发明第三实施例提供的变分邻域编码器的结构示意图,包括:Please refer to FIG3 , which is a schematic diagram of the structure of a variational neighborhood encoder provided by a third embodiment of the present invention, including:

输入层,用于分别确定文档网络集中各文档的文档输入表示;其中,对于文档网络中的每一个文档,输入层目的是得到每一个文档对应的输入表示。The input layer is used to determine the document input representation of each document in the document network set respectively; wherein, for each document in the document network, the purpose of the input layer is to obtain the input representation corresponding to each document.

编码层,用于对各文档的文档输入表示进行编码处理,得到各文档的隐藏层表示,并对所述隐藏层表示进行重参数化与注意力机制处理,得到中心文档的主题表示。The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with the attention mechanism to obtain the topic representation of the central document.

其中,编码层包括编码器10和重参数化层11,编码器用于通过一个全连接层对中心文档进行编码,推断出隐藏层表示,重参数化层11用于通过重参数化与注意力机制得到中心文档的主题表示。编码器10中使用正态分布作为先验分布。The encoding layer includes an encoder 10 and a reparameterization layer 11. The encoder is used to encode the central document through a fully connected layer to infer the hidden layer representation. The reparameterization layer 11 is used to obtain the topic representation of the central document through reparameterization and attention mechanism. Normal distribution is used as the prior distribution in the encoder 10.

注意力层12,用于使用点积注意力聚集各文档的邻域文档与所述中心文档的主题表示,得到所述中心文档的表示;其中,在重参数化后,变分邻域编码器使用点积注意力聚集邻域文档与中心文档的主题表示,得到中心文档的表示。The attention layer 12 is used to use dot product attention to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document; wherein, after reparameterization, the variational neighborhood encoder uses dot product attention to aggregate the topic representation of the neighborhood documents and the central document to obtain the representation of the central document.

解码器13,用于根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布。The decoder 13 is used to determine the document-topic distribution according to the representation of the central document, and determine the topic-word distribution according to the document-topic distribution.

本实施例,通过分别确定文档网络集中各文档的文档输入表示,基于文档输入表示能有效地对各文档起到编码效果,通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,能有效地推断出各文档对应的隐藏层表示,基于隐藏层表示能有效地确定到中心文档的表示,基于中心文档的表示能有效地确定到文档-主题分布,基于文档-主题分布能有效地确定到主题-词分布,以达到对文档的主题建模效果。In this embodiment, by respectively determining the document input representation of each document in the document network set, each document can be effectively encoded based on the document input representation, and the hidden layer representation corresponding to each document can be effectively inferred by inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing. Based on the hidden layer representation, the representation of the central document can be effectively determined, based on the representation of the central document, the document-topic distribution can be effectively determined, and based on the document-topic distribution, the topic-word distribution can be effectively determined, so as to achieve the topic modeling effect of the document.

实施例四Embodiment 4

图4是本申请第四实施例提供的一种终端设备2的结构框图。如图4所示,该实施例的终端设备2包括:处理器20、存储器21以及存储在所述存储器21中并可在所述处理器20上运行的计算机程序22,例如文档网络主题建模方法的程序。处理器20执行所述计算机程序22时实现上述各个文档网络主题建模方法各实施例中的步骤。FIG4 is a block diagram of a terminal device 2 provided in the fourth embodiment of the present application. As shown in FIG4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21, and a computer program 22 stored in the memory 21 and executable on the processor 20, such as a program of a document network topic modeling method. When the processor 20 executes the computer program 22, the steps in each embodiment of the above-mentioned document network topic modeling method are implemented.

示例性的,所述计算机程序22可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器21中,并由所述处理器20执行,以完成本申请。所述一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序22在所述终端设备2中的执行过程。所述终端设备可包括,但不仅限于,处理器20、存储器21。Exemplarily, the computer program 22 may be divided into one or more modules, which are stored in the memory 21 and executed by the processor 20 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of completing specific functions, which are used to describe the execution process of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20 and a memory 21.

所称处理器20可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 20 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc.

所述存储器21可以是所述终端设备2的内部存储单元,例如终端设备2的硬盘或内存。所述存储器21也可以是所述终端设备2的外部存储设备,例如所述终端设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器21还可以既包括所述终端设备2的内部存储单元也包括外部存储设备。所述存储器21用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器21还可以用于暂时地存储已经输出或者将要输出的数据。The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card, etc. equipped on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used to store the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.

另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional module in each embodiment of the present application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of software functional units.

集成的模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。其中,计算机可读存储介质可以是非易失性的,也可以是易失性的。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。If the integrated module is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Among them, the computer-readable storage medium can be non-volatile or volatile. Based on this understanding, the present application implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer program can implement the steps of the above-mentioned various method embodiments when executed by the processor. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc.

以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The embodiments described above are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, a person skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. Such modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application, and should all be included in the protection scope of the present application.

Claims (7)

1.一种文档网络主题建模方法,其特征在于,所述方法包括如下步骤:1. A document network topic modeling method, characterized in that the method comprises the following steps: 获取文档网络集,并分别确定所述文档网络集中各文档的文档输入表示;Obtaining a document network set, and determining a document input representation of each document in the document network set; 将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到各文档的隐藏层表示,并根据所述隐藏层表示确定中心文档的表示;Inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain a hidden layer representation of each document, and determining the representation of the central document according to the hidden layer representation; 根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布;Determining a document-topic distribution based on the representation of the central document, and determining a topic-word distribution based on the document-topic distribution; 所述分别确定所述文档网络集中各文档的文档输入表示采用的公式包括:The formula used to respectively determine the document input representation of each document in the document network set includes:
Figure QLYQS_1
Figure QLYQS_1
;
Figure QLYQS_2
Figure QLYQS_2
;
Figure QLYQS_3
Figure QLYQS_3
;
其中,V表示文档集中词构成的词典,
Figure QLYQS_6
表示文档
Figure QLYQS_9
与文档
Figure QLYQS_13
之间最短路径的长度,
Figure QLYQS_5
为文档
Figure QLYQS_11
中单词
Figure QLYQS_14
出现的次数,
Figure QLYQS_16
是文本向量,
Figure QLYQS_4
为0-1邻域向量,
Figure QLYQS_8
是高阶邻域向量,
Figure QLYQS_12
代表单词
Figure QLYQS_15
在文档
Figure QLYQS_7
中的权重,
Figure QLYQS_10
表示中心文档;
Among them, V represents the dictionary composed of words in the document set,
Figure QLYQS_6
Representation Document
Figure QLYQS_9
With Documentation
Figure QLYQS_13
The length of the shortest path between
Figure QLYQS_5
For Documentation
Figure QLYQS_11
Chinese words
Figure QLYQS_14
The number of occurrences,
Figure QLYQS_16
is a text vector,
Figure QLYQS_4
is a 0-1 neighborhood vector,
Figure QLYQS_8
is a high-order neighborhood vector,
Figure QLYQS_12
Representative words
Figure QLYQS_15
In the documentation
Figure QLYQS_7
The weight in
Figure QLYQS_10
Represents the central document;
所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理之前,还包括:Before inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding, the method further includes: 获取各样本文档的样本输入表示,并将各样本文档的样本输入表示输入所述变分邻域编码器进行编码处理,得到样本推断分布参数;Obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter; 根据所述样本推断分布参数确定样本主题表示,并根据所述样本主题表示对各样本文档进行重构,得到重构文档;Determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document; 根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失,并根据各样本文档和所述重构文档确定重构损失;Determine a priori loss according to the sample inference distribution parameter and the prior normal distribution parameter of each sample document, and determine reconstruction loss according to each sample document and the reconstructed document; 根据所述先验损失和所述重构损失对所述变分邻域编码器进行参数更新,直至所述变分邻域编码器收敛,得到预训练后的所述变分邻域编码器;updating parameters of the variational neighborhood encoder according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges to obtain the pre-trained variational neighborhood encoder; 所述根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失,并根据各样本文档和所述重构文档确定重构损失采用的公式包括:The formula used for determining the prior loss according to the sample inference distribution parameters and the prior normal distribution parameters of each sample document, and determining the reconstruction loss according to each sample document and the reconstructed document includes:
Figure QLYQS_17
Figure QLYQS_17
;
其中,
Figure QLYQS_20
为各样本文档的邻域文档,
Figure QLYQS_24
为由隐藏主题重新生成的邻域文档,
Figure QLYQS_28
表示总损失,
Figure QLYQS_19
表示重构损失,
Figure QLYQS_23
表示先验损失,
Figure QLYQS_27
表示权重参数,
Figure QLYQS_30
为样本文档中的单词,
Figure QLYQS_18
为重构样本中的单词,KL(·)表示所述样本推断分布参数和先验正态分布参数的KL散度,
Figure QLYQS_22
Figure QLYQS_26
分别为所述变分邻域编码器中推断网络推断出的推断分布的均值与方差,
Figure QLYQS_29
Figure QLYQS_21
为所述先验正态分布参数的均值与方差,
Figure QLYQS_25
为正态分布。
in,
Figure QLYQS_20
is the neighborhood document of each sample document,
Figure QLYQS_24
is the neighborhood document regenerated from the hidden topic,
Figure QLYQS_28
represents the total loss,
Figure QLYQS_19
represents the reconstruction loss,
Figure QLYQS_23
represents the prior loss,
Figure QLYQS_27
represents the weight parameter,
Figure QLYQS_30
are the words in the sample documents,
Figure QLYQS_18
To reconstruct the words in the sample, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter,
Figure QLYQS_22
and
Figure QLYQS_26
are respectively the mean and variance of the inferred distribution inferred by the inference network in the variational neighborhood encoder,
Figure QLYQS_29
and
Figure QLYQS_21
are the mean and variance of the prior normal distribution parameters,
Figure QLYQS_25
is a normal distribution.
2.如权利要求1所述的文档网络主题建模方法,其特征在于,所述根据所述隐藏层表示确定中心文档的表示,包括:2. The document network topic modeling method according to claim 1, wherein determining the representation of the central document according to the hidden layer representation comprises: 对所述隐藏层表示进行重参数化和注意力机制处理,得到所述中心文档的主题表示;Reparameterizing and processing the hidden layer representation with an attention mechanism to obtain a topic representation of the central document; 使用点积注意力机制聚集各文档的邻域文档和所述中心文档的主题表示,得到所述中心文档的表示。The dot product attention mechanism is used to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document. 3.如权利要求2所述的文档网络主题建模方法,其特征在于,所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理,得到各文档的隐藏层表示采用的公式包括:3. The document network topic modeling method according to claim 2, characterized in that the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding processing, and the formula used to obtain the hidden layer representation of each document includes:
Figure QLYQS_31
Figure QLYQS_31
;
其中,
Figure QLYQS_34
表示激活函数,
Figure QLYQS_38
均表示所述变分邻域编码器中编码器对应全连接层的训练参数,
Figure QLYQS_41
Figure QLYQS_35
Figure QLYQS_37
Figure QLYQS_40
Figure QLYQS_43
表示全体实数空间,
Figure QLYQS_32
为主题个数,
Figure QLYQS_36
为词典大小,
Figure QLYQS_39
表示对数化方差,
Figure QLYQS_42
表示中心文档的隐藏层表示,
Figure QLYQS_33
表示从规模与μσ相同的多元标准正态分布中随机生成得到的样本表示。
in,
Figure QLYQS_34
represents the activation function,
Figure QLYQS_38
Both represent the training parameters of the encoder corresponding to the fully connected layer in the variational neighborhood encoder,
Figure QLYQS_41
,
Figure QLYQS_35
,
Figure QLYQS_37
and
Figure QLYQS_40
,
Figure QLYQS_43
represents the space of all real numbers,
Figure QLYQS_32
is the number of topics,
Figure QLYQS_36
is the dictionary size,
Figure QLYQS_39
represents the logarithmic variance,
Figure QLYQS_42
represents the hidden layer representation of the central document,
Figure QLYQS_33
represents a sample representation randomly generated from a multivariate standard normal distribution of the same size as μ and σ .
4.如权利要求3所述的文档网络主题建模方法,其特征在于,所述根据所述中心文档的表示确定文档-主题分布采用的公式包括:4. The document network topic modeling method according to claim 3, characterized in that the formula used to determine the document-topic distribution based on the representation of the central document includes:
Figure QLYQS_44
Figure QLYQS_44
;
其中,
Figure QLYQS_45
表示与中心文档
Figure QLYQS_53
之间存在路径的邻域文档集合,
Figure QLYQS_58
表示中心文档
Figure QLYQS_49
的邻域文档,
Figure QLYQS_57
表示标准对数正态分布,
Figure QLYQS_52
表示邻域文档对中心文档的影响程度,
Figure QLYQS_55
为中心文档
Figure QLYQS_48
与邻域文档
Figure QLYQS_59
之间的最短路径长度,
Figure QLYQS_46
为中心文档与邻域文档的关联程度,
Figure QLYQS_54
为中心文档隐藏层表示的转置,
Figure QLYQS_47
为邻域文档的隐藏层表示,
Figure QLYQS_56
为中心文档隐藏层表示与邻域文档隐藏层表示的注意力系数,
Figure QLYQS_50
为文档-主题分布,
Figure QLYQS_60
为未归一化的中心文档主题表示,
Figure QLYQS_51
表示归一化函数。
in,
Figure QLYQS_45
Representation and central documentation
Figure QLYQS_53
A collection of neighboring documents with paths between them,
Figure QLYQS_58
Representation Center Document
Figure QLYQS_49
Neighborhood documents,
Figure QLYQS_57
represents the standard lognormal distribution,
Figure QLYQS_52
Indicates the degree of influence of the neighborhood document on the central document,
Figure QLYQS_55
Central Document
Figure QLYQS_48
With Neighborhood Documents
Figure QLYQS_59
The shortest path length between
Figure QLYQS_46
is the correlation degree between the central document and the neighboring documents,
Figure QLYQS_54
is the transpose of the hidden layer representation of the central document,
Figure QLYQS_47
is the hidden layer representation of the neighborhood documents,
Figure QLYQS_56
is the attention coefficient between the hidden layer representation of the central document and the hidden layer representation of the neighboring document,
Figure QLYQS_50
is the document-topic distribution,
Figure QLYQS_60
is the unnormalized central document topic representation,
Figure QLYQS_51
Represents the normalization function.
5.一种变分邻域编码器,其特征在于,应用于权利要求1至4任意一项所述的文档网络主题建模方法,所述变分邻域编码器包括:5. A variational neighborhood encoder, characterized in that it is applied to the document network topic modeling method according to any one of claims 1 to 4, and the variational neighborhood encoder comprises: 输入层,用于分别确定文档网络集中各文档的文档输入表示;The input layer is used to determine the document input representation of each document in the document network set respectively; 编码层,用于对各文档的文档输入表示进行编码处理,得到各文档的隐藏层表示,并对所述隐藏层表示进行重参数化与注意力机制处理,得到中心文档的主题表示;The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with an attention mechanism to obtain the topic representation of the central document; 注意力层,用于使用点积注意力聚集各文档的邻域文档与所述中心文档的主题表示,得到所述中心文档的表示;An attention layer, used to aggregate the neighborhood documents of each document and the topic representation of the central document using dot product attention to obtain the representation of the central document; 解码器,用于根据所述中心文档的表示确定文档-主题分布,并根据所述文档-主题分布确定主题-词分布。A decoder is configured to determine a document-topic distribution based on the representation of the central document, and determine a topic-word distribution based on the document-topic distribution. 6.一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至4任一项所述方法的步骤。6. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 4 when executing the computer program. 7.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述方法的步骤。7. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 4.
CN202310135750.9A 2023-02-20 2023-02-20 Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media Active CN115879515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310135750.9A CN115879515B (en) 2023-02-20 2023-02-20 Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310135750.9A CN115879515B (en) 2023-02-20 2023-02-20 Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media

Publications (2)

Publication Number Publication Date
CN115879515A CN115879515A (en) 2023-03-31
CN115879515B true CN115879515B (en) 2023-05-12

Family

ID=85761364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310135750.9A Active CN115879515B (en) 2023-02-20 2023-02-20 Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media

Country Status (1)

Country Link
CN (1) CN115879515B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112836017A (en) * 2021-02-09 2021-05-25 天津大学 An event detection method based on hierarchical topic-driven self-attention mechanism

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2386039B (en) * 2002-03-01 2005-07-06 Fujitsu Ltd Data encoding and decoding apparatus and a data encoding and decoding method
US10346524B1 (en) * 2018-03-29 2019-07-09 Sap Se Position-dependent word salience estimation
CN110457708B (en) * 2019-08-16 2023-05-16 腾讯科技(深圳)有限公司 Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN111949790A (en) * 2020-07-20 2020-11-17 重庆邮电大学 Sentiment classification method based on LDA topic model and hierarchical neural network
CN112199607A (en) * 2020-10-30 2021-01-08 天津大学 Microblog topic mining method based on parallel social context fusion in variable neighborhood
CN113434664B (en) * 2021-06-30 2024-07-16 平安科技(深圳)有限公司 Text abstract generation method, device, medium and electronic equipment
CN114116974A (en) * 2021-11-19 2022-03-01 深圳市东汇精密机电有限公司 Emotional cause extraction method based on attention mechanism
CN114281990A (en) * 2021-12-17 2022-04-05 北京百度网讯科技有限公司 Document classification method and device, electronic equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112836017A (en) * 2021-02-09 2021-05-25 天津大学 An event detection method based on hierarchical topic-driven self-attention mechanism

Also Published As

Publication number Publication date
CN115879515A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
US11615255B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
US11900056B2 (en) Stylistic text rewriting for a target author
US20240185080A1 (en) Self-supervised data obfuscation in foundation models
US20220269928A1 (en) Stochastic noise layers
CN107402859B (en) Software function verification system and verification method thereof
Capitanelli et al. Fractional equations via convergence of forms
CN114970513A (en) Image generation method, device, equipment and storage medium
CN115879515B (en) Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media
CN115169342A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN112835798A (en) Cluster learning method, test step clustering method and related device
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
JP2019021218A (en) Learning device, program parameter, learning method and model
CN115758211B (en) Text information classification method, apparatus, electronic device and storage medium
CN115115920B (en) Graph data self-supervision training method and device
JP7529048B2 (en) Information processing device, information processing method, and program
Yu et al. Sentence encoding with tree-constrained relation networks
JP2021197015A (en) Deduction device
CN119005177B (en) Sequence processing method, electronic device and storage medium
CN113312897B (en) A text summarization method, electronic device and storage medium
US20230259786A1 (en) Obfuscation of encoded data with limited supervision
Haeupler et al. Coding for interactive communication with small memory and applications to robust circuits
Zheng et al. A modified expectation‐maximization algorithm for latent Gaussian graphical model
Malhotra Transfer learning based entropy optimized semi-supervised decomposed vector-quantized variational autoencoder model for multiclass text classification and generation
CN118821849A (en) A model approach to characterizing smart contracts
Rongali et al. Investigating Applications of Run Length Encoding in Data Compression & Source Coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant