CN115879515B

CN115879515B - Document Network Topic Modeling Method, Variational Neighborhood Encoder, Terminal and Media

Info

Publication number: CN115879515B
Application number: CN202310135750.9A
Authority: CN
Inventors: 刘德喜; 张子靖; 刘嘉鸣; 万齐智; 邓辉
Original assignee: Jiangxi University of Finance and Economics
Current assignee: Jiangxi University of Finance and Economics
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-05-12
Anticipated expiration: 2043-02-20
Also published as: CN115879515A

Abstract

The present invention provides a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium. The method includes: obtaining a document network set, and respectively determining the document input representation of each document in the document network set; The document input representation of the document is input into the pre-trained variational neighborhood encoder for encoding processing to obtain the hidden layer representation of each document, and determine the representation of the central document according to the representation of the hidden layer; determine the representation of the central document according to the representation of the central document -topic distribution, and determine topic-term distribution based on said document-topic distribution. The present invention can effectively determine the representation of the central document based on the hidden layer representation of each document, can effectively determine the document-topic distribution based on the representation of the central document, and can effectively determine the topic-word distribution based on the document-topic distribution, so that To achieve the effect of topic modeling on the document network.

Description

Document network topic modeling method, variational neighborhood encoder, terminal and medium

技术领域Technical Field

本发明涉及主题建模技术领域，尤其涉及一种文档网络主题建模方法、变分邻域编码器、终端及介质。The present invention relates to the technical field of topic modeling, and in particular to a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium.

背景技术Background Art

文档网络是由文档及其关系组成的网络，例如，由学术论文互相引用组成的网络，由网页文本互相链接组成的网络等。文档网络作为文本数据的重要组成部分，获取文档网络中文档的主题能够让人们更好地理解文档的内容分布，因此，如何有效地对文档网络中的文档进行主题建模，是目前亟需解决的问题。A document network is a network composed of documents and their relationships, for example, a network composed of academic papers citing each other, a network composed of web page texts linking each other, etc. As an important part of text data, document networks can help people better understand the content distribution of documents by obtaining the topics of documents in document networks. Therefore, how to effectively perform topic modeling on documents in document networks is a problem that needs to be solved urgently.

发明内容Summary of the invention

本发明实施例的目的在于提供一种文档网络主题建模方法、变分邻域编码器、终端及介质，旨在解决现有技术中，如何能有效地对文档网络中的文档进行主题建模的问题。The purpose of the embodiments of the present invention is to provide a document network topic modeling method, a variational neighborhood encoder, a terminal and a medium, aiming to solve the problem in the prior art of how to effectively perform topic modeling on documents in a document network.

本发明实施例是这样实现的，一种文档网络主题建模方法，所述方法包括：The embodiment of the present invention is implemented as follows: a document network topic modeling method, the method comprising:

获取文档网络集，并分别确定所述文档网络集中各文档的文档输入表示；Obtaining a document network set, and determining a document input representation of each document in the document network set;

将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，得到隐藏层表示，并根据所述隐藏层表示确定中心文档的表示；Inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain a hidden layer representation, and determining the representation of the central document according to the hidden layer representation;

根据所述中心文档的表示确定文档-主题分布，并根据所述文档-主题分布确定主题-词分布。A document-topic distribution is determined based on the representation of the central document, and a topic-word distribution is determined based on the document-topic distribution.

进一步地，所述分别确定所述文档网络集中各文档的文档输入表示采用的公式包括：Furthermore, the formula used to respectively determine the document input representation of each document in the document network set includes:

；

;

；

;

；

;

其中，V表示文档集中词构成的词典，

表示文档

与文档

之间最短路径的长度，

为文档

中单词

出现的次数，

是文本向量，

为0-1邻域向量，

是高阶邻域向量，

代表单词

在文档

中的权重，

表示中心文档。Among them, V represents the dictionary composed of words in the document set,

Representation Document

With Documentation

The length of the shortest path between

For Documentation

Chinese words

The number of occurrences,

is a text vector,

is a 0-1 neighborhood vector,

is a high-order neighborhood vector,

Representative words

In the documentation

The weight in

Represents the central document.

进一步地，其特征在于，所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理之前，还包括：Furthermore, it is characterized in that before the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding processing, it also includes:

获取各样本文档的样本输入表示，并将各样本文档的样本输入表示输入所述变分邻域编码器进行编码处理，得到样本推断分布参数；Obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter;

根据所述样本推断分布参数确定样本主题表示，并根据所述样本主题表示对各样本文档进行重构，得到重构文档；Determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document;

根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失，并根据各样本文档和所述重构文档确定重构损失；Determine a priori loss according to the sample inference distribution parameter and the prior normal distribution parameter of each sample document, and determine reconstruction loss according to each sample document and the reconstructed document;

根据所述先验损失和所述重构损失对所述变分邻域编码器进行参数更新，直至所述变分邻域编码器收敛，得到预训练后的所述变分邻域编码器。The variational neighborhood encoder is parameter updated according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges, thereby obtaining the pre-trained variational neighborhood encoder.

进一步地，所述根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失，并根据各样本文档和所述重构文档确定重构损失采用的公式包括：Furthermore, the formula used for determining the prior loss according to the sample inference distribution parameters and the prior normal distribution parameters of each sample document and determining the reconstruction loss according to each sample document and the reconstructed document includes:

；

;

其中，

为各样本文档的邻域文档，

为由隐藏主题重新生成的邻域文档，

表示总损失，

表示重构损失，

表示先验损失，

表示权重参数，

为样本文档中的单词，

为重构样本中的单词，KL(·)表示所述样本推断分布参数和先验正态分布参数的KL散度，μ和σ分别为所述变分邻域编码器中推断网络推断出的推断分布的均值与方差，

和

为所述先验正态分布参数的均值与方差，

为正态分布。in,

is the neighborhood document of each sample document,

is the neighborhood document regenerated from the hidden topic,

represents the total loss,

represents the reconstruction loss,

represents the prior loss,

represents the weight parameter,

are the words in the sample documents,

To reconstruct the words in the sample, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter, μ and σ are the mean and variance of the inference distribution inferred by the inference network in the variational neighborhood encoder, respectively.

and

are the mean and variance of the prior normal distribution parameters,

is a normal distribution.

进一步地，所述根据所述隐藏层表示确定中心文档的表示，包括：Further, determining the representation of the central document according to the hidden layer representation includes:

对所述隐藏层表示进行重参数化和注意力机制处理，得到所述中心文档的主题表示；Reparameterizing and processing the hidden layer representation with an attention mechanism to obtain a topic representation of the central document;

使用点积注意力机制聚集各文档的邻域文档和所述中心文档的主题表示，得到所述中心文档的表示。The dot product attention mechanism is used to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document.

进一步地，所述将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，得到隐藏层表示采用的公式包括：Furthermore, the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding, and the formula used to obtain the hidden layer representation includes:

；

;

其中，

表示激活函数，

均表示所述变分邻域编码器中编码器对应全连接层的训练参数，

，

，

和

，

表示全体实数空间，t为主题个数，m为词典大小，

表示对数化方差，

表示中心文档的隐藏层表示，

表示从规模与μ和σ相同的多元标准正态分布中随机生成得到的样本表示。in,

represents the activation function,

Both represent the training parameters of the encoder corresponding to the fully connected layer in the variational neighborhood encoder,

,

and

,

represents the entire real number space, t is the number of topics, m is the dictionary size,

represents the logarithmic variance,

represents the hidden layer representation of the central document,

represents a sample representation randomly generated from a multivariate standard normal distribution of the same size as μ and σ .

进一步地，所述根据所述中心文档的表示确定文档-主题分布采用的公式包括：Furthermore, the formula used to determine the document-topic distribution according to the representation of the central document includes:

；

;

其中，N(d)表示与中心文档d之间存在路径的邻域文档集合，

表示中心文档d的邻域文档，

表示标准对数正态分布，weight表示邻域文档对中心文档的影响程度，

为中心文档d与邻域文档

之间的最短路径长度，

为中心文档与邻域文档的关联程度，

为中心文档隐藏层表示的转置，

为邻域文档的隐藏层表示，

为中心文档隐藏层表示与邻域文档隐藏层表示的注意力系数，θ为文档-主题分布，

为未归一化的中心文档主题表示，

表示归一化函数。Where N ( d ) represents the set of neighborhood documents that have paths with the central document d .

represents the neighborhood documents of the central document d ,

represents the standard log-normal distribution, weight represents the influence of the neighboring document on the central document,

is the central document d and the neighboring documents

The shortest path length between

is the correlation degree between the central document and the neighboring documents,

is the transpose of the hidden layer representation of the central document,

is the hidden layer representation of the neighborhood documents,

is the attention coefficient between the hidden layer representation of the central document and the hidden layer representation of the neighborhood document, θ is the document-topic distribution,

is the unnormalized central document topic representation,

Represents the normalization function.

本发明实施例的另一目的在于提供一种变分邻域编码器，其中，应用于任意一项所述的文档网络主题建模方法，所述变分邻域编码器包括：Another object of an embodiment of the present invention is to provide a variational neighborhood encoder, wherein the variational neighborhood encoder is applied to any one of the document network topic modeling methods, and comprises:

输入层，用于分别确定文档网络集中各文档的文档输入表示；The input layer is used to determine the document input representation of each document in the document network set respectively;

编码层，用于对各文档的文档输入表示进行编码处理，得到各文档的隐藏层表示，并对所述隐藏层表示进行重参数化与注意力机制处理，得到中心文档的主题表示；The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with an attention mechanism to obtain the topic representation of the central document;

注意力层，用于使用点积注意力聚集各文档的邻域文档与所述中心文档的主题表示，得到所述中心文档的表示；An attention layer, used to aggregate the neighborhood documents of each document and the topic representation of the central document using dot product attention to obtain the representation of the central document;

解码器，用于根据所述中心文档的表示确定文档-主题分布，并根据所述文档-主题分布确定主题-词分布。A decoder is configured to determine a document-topic distribution based on the representation of the central document, and determine a topic-word distribution based on the document-topic distribution.

本发明实施例的另一目的在于提供一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上述方法的步骤。Another object of an embodiment of the present invention is to provide a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the computer program.

本发明实施例的另一目的在于提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述方法的步骤。Another object of an embodiment of the present invention is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the above method are implemented.

本发明实施例，通过分别确定文档网络集中各文档的文档输入表示，基于文档输入表示能有效地对各文档起到编码效果，通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，能有效地推断出各文档对应的隐藏层表示，基于隐藏层表示能有效地确定到中心文档的表示，基于中心文档的表示能有效地确定到文档-主题分布，基于文档-主题分布能有效地确定到主题-词分布，以达到对文档的主题建模效果。The embodiments of the present invention can effectively encode each document by respectively determining the document input representation of each document in the document network set, and can effectively infer the hidden layer representation corresponding to each document by inputting the document input representation of each document into a pre-trained variational neighborhood encoder for encoding processing, and can effectively determine the representation of the central document based on the hidden layer representation, and can effectively determine the document-topic distribution based on the representation of the central document, and can effectively determine the topic-word distribution based on the document-topic distribution, so as to achieve the topic modeling effect of the document.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明第一实施例提供的文档网络主题建模方法的流程图；FIG1 is a flow chart of a document network topic modeling method provided by a first embodiment of the present invention;

图2是本发明第二实施例提供的文档网络主题建模方法的流程图；2 is a flow chart of a document network topic modeling method provided by a second embodiment of the present invention;

图3是本发明第三实施例提供的变分邻域编码器的结构示意图；3 is a schematic diagram of the structure of a variational neighborhood encoder provided by a third embodiment of the present invention;

图4是本发明第四实施例提供的终端设备的结构示意图。FIG. 4 is a schematic diagram of the structure of a terminal device provided in a fourth embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

为了说明本发明所述的技术方案，下面通过具体实施例来进行说明。In order to illustrate the technical solution of the present invention, a specific embodiment is provided below for illustration.

实施例一Embodiment 1

请参阅图1，是本发明第一实施例提供的文档网络主题建模方法的流程图，该文档网络主题建模方法可以应用于任一终端设备或系统，该文档网络主题建模方法包括步骤：Please refer to FIG. 1 , which is a flowchart of a document network topic modeling method provided by a first embodiment of the present invention. The document network topic modeling method can be applied to any terminal device or system. The document network topic modeling method includes the following steps:

步骤S10，获取文档网络集，并分别确定所述文档网络集中各文档的文档输入表示；Step S10, obtaining a document network set, and determining the document input representation of each document in the document network set respectively;

其中，给定一个包含n个文档的文档网络集

，由D中的词构成的词典记作V，其中包含m个单词，文档网络集D可表示为文档-词矩阵

。其中，

代表单词

在文档

中的权重(例如：TF-IDF)。Among them, given a document network set containing n documents

, the dictionary composed of words in D is denoted as V , which contains m words. The document network set D can be represented as a document-word matrix

.in,

Representative words

In the documentation

The weights in (e.g. TF-IDF).

D中文档的关系用0-1邻域矩阵表示，即

，其中元素

为1表示

，

存在边，为0表示不存在边，文档网络G表示为

，其中D表示文档网络集，A表示D中文档的邻域矩阵，X表示文档-词矩阵，V表示D中的词构成的词典；The relationship between documents in D is represented by a 0-1 neighborhood matrix, that is,

, where the elements

1 means

,

There is an edge, and 0 means there is no edge. The document network G is represented as

, where D represents the document network set, A represents the neighborhood matrix of the documents in D , X represents the document-word matrix, and V represents the dictionary composed of the words in D ;

该步骤中，为了建模高阶图结构信息，邻域矩阵A中不仅要记录两个文档的直接连接关系，即一阶邻域信息，还应记录二阶甚至更高阶的邻域信息，高阶邻域矩阵

中元素的定义如公式(1)所示。其中，

表示文档

与文档

之间路径的长度；In this step, in order to model high-order graph structure information, the neighborhood matrix A should not only record the direct connection relationship between two documents, that is, the first-order neighborhood information, but also the second-order or even higher-order neighborhood information.

The definition of the elements in is shown in formula (1).

Representation Document

With Documentation

The length of the path between them;

；公式(1)

; Formula (1)

对于文档-词矩阵X，使用对数正则化的方式初始化，如公式(2)所示。其中，

为文档

中单词

出现的次数：For the document-word matrix X , logarithmic regularization is used for initialization, as shown in formula (2).

For Documentation

Chinese words

Number of occurrences:

；公式(2)

; Formula (2)

最后，如公式(3)所示将文本向量x、邻域向量a或高阶邻域向量

结合，得到各文档的文档输入表示：Finally, as shown in formula (3), the text vector x , the neighborhood vector a or the high-order neighborhood vector

Combined, we get the document input representation of each document:

；公式(3)

; Formula (3)

步骤S20，将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，得到各文档的隐藏层表示，并根据所述隐藏层表示确定中心文档的表示；Step S20, inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain the hidden layer representation of each document, and determining the representation of the central document according to the hidden layer representation;

可选的，该步骤中，所述根据所述隐藏层表示确定中心文档的表示，包括：Optionally, in this step, determining the representation of the central document according to the hidden layer representation includes:

使用点积注意力机制聚集各文档的邻域文档和所述中心文档的主题表示，得到所述中心文档的表示；Use the dot product attention mechanism to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document;

其中，预训练后的变分邻域编码器（Variational Adjacent-Encoder, VADJE）通过一个全连接层对中心文档进行编码，推断出各文档的隐藏层表示，然后通过重参数化与注意力机制得到中心文档的隐藏层表示

。The pre-trained variational neighbor encoder (VADJE) encodes the central document through a fully connected layer, infers the hidden layer representation of each document, and then obtains the hidden layer representation of the central document through reparameterization and attention mechanism.

.

本实施例使用正态分布作为先验分布，变分邻域编码器使用反正切函数作为编码阶段全连接层的激活函数，并使用Xavier Glorot对训练参数进行初始化。编码阶段的全连接层与重参数化过程如公式(4)所示：This embodiment uses the normal distribution as the prior distribution, the variational neighborhood encoder uses the inverse tangent function as the activation function of the fully connected layer in the encoding stage, and uses Xavier Glorot to initialize the training parameters. The fully connected layer and reparameterization process in the encoding stage are shown in formula (4):

；公式(4)

; Formula (4)

其中，

表示激活函数，

，

，

和

，

表示全体实数空间，t为主题个数，m为词典大小，

表示对数化方差，

表示中心文档的隐藏层表示，

表示从规模与μ和σ相同的多元标准正态分布中随机生成得到的样本表示。重参数化的思想是通过从标准正态分布

中采样一个变量，再对该变量进行仿射变换得到所需隐变量，用于类VAE模型中反向传播的问题。in,

represents the activation function,

,

and

,

represents the logarithmic variance,

represents the hidden layer representation of the central document,

represents the sample representation randomly generated from a multivariate standard normal distribution of the same size as μ and σ . The idea of reparameterization is to generate

A variable is sampled from the dataset, and then an affine transformation is performed on the variable to obtain the required latent variable, which is used for the back propagation problem in the VAE-like model.

步骤S30，根据所述中心文档的表示确定文档-主题分布，并根据所述文档-主题分布确定主题-词分布；Step S30, determining a document-topic distribution according to the representation of the central document, and determining a topic-word distribution according to the document-topic distribution;

其中，在重参数化后，变分邻域编码器使用点积注意力聚集邻域文档

与中心文档的主题表示

，得到未归一化的中心文档主题表示

，随后用softmax函数将

转换为文档的文档-主题分布θ，具体过程如公式(5)所示：Among them, after reparameterization, the variational neighborhood encoder uses dot product attention to aggregate neighborhood documents

Theme representation with central document

, get the unnormalized central document topic representation

, and then use the softmax function to

Converted to the document-topic distribution θ of the document, the specific process is shown in formula (5):

；公式(5)

; Formula (5)

其中，d表示中心文档，N(d)表示与中心文档d之间存在路径的邻域文档集合，

表示中心文档d的邻域文档，

为中心文档d与邻域文档

之间的最短路径长度，

为中心文档与邻域文档的关联程度，

为中心文档隐藏层表示的转置，

为邻域文档的隐藏层表示，

为未归一化的中心文档主题表示，

表示归一化函数。Where d represents the central document, N ( d ) represents the set of neighborhood documents that have paths with the central document d ,

represents the neighborhood documents of the central document d ,

is the central document d and the neighboring documents

The shortest path length between

is the transpose of the hidden layer representation of the central document,

is the hidden layer representation of the neighborhood documents,

is the unnormalized central document topic representation,

Represents the normalization function.

该步骤中，在解码阶段，变分邻域编码器基于文档网络中存在的文档关系，不仅用中心文档的隐藏主题生成中心文档本身，还生成其邻域文档，如公式(6)所示：In this step, in the decoding stage, the variational neighborhood encoder generates not only the central document itself but also its neighborhood documents using the hidden topics of the central document based on the document relations existing in the document network, as shown in formula (6):

；公式(6)

; Formula (6)

其中，

为激活函数，

和

为解码器中对应全连接层的可训练参数，

为由隐藏主题重新生成的邻域文档，将解码器中的权重与偏置参数进行softmax变化可以得到主题-词分布β。in,

is the activation function,

and

are the trainable parameters of the corresponding fully connected layer in the decoder,

For the neighborhood documents regenerated from the hidden topics, the topic-word distribution β can be obtained by performing softmax transformation on the weights and bias parameters in the decoder.

本实施例中，基于变分邻域编码器的文档网络生成步骤包括：In this embodiment, the document network generation step based on the variational neighborhood encoder includes:

生成文档时，首先通过变分邻域编码器中的推断网络得到对应的分布参数

和

，

与

分别表示VADJE的均值推断网络和标准差推断网络；When generating a document, the corresponding distribution parameters are first obtained through the inference network in the variational neighborhood encoder

and

,

and

Represent the mean inference network and standard deviation inference network of VADJE respectively;

通过重参数化生成文档的主题分布

。对于给定文本，每个单词从对应的文本的词分布中生成，而文本的词分布可由文档的主题分布

与主题的词分布

得到，且为多项分布，即：Generating topic distribution of documents by reparameterization

For a given text, each word is generated from the word distribution of the corresponding text, and the word distribution of the text can be obtained from the topic distribution of the document

Word distribution with topic

We get , and it is a multinomial distribution, that is:

；

;

其中，

表示中心文档d的单词，

表示多项分布，文档连接生成时，将其建模为伯努利二元变量，根据文档的主题分布计算连接存在的概率，即

，其中

表示神经网络的全连接层，

表示伯努利分布。in,

represents the word of the central document d ,

Represents a multinomial distribution. When a document connection is generated, it is modeled as a Bernoulli binary variable, and the probability of the connection existing is calculated based on the topic distribution of the document, that is,

,in

represents the fully connected layer of the neural network,

represents a Bernoulli distribution.

具体的，对于每个文档

；Specifically, for each document

;

生成一个均值向量

：

；Generate a mean vector

:

;

生成对数协方差

：

；Generate log covariance

:

;

生成多元标准正态分布的样本

：

；Generate samples from a multivariate standard normal distribution

:

;

生成文本主题分布

：

；Generating text topic distribution

:

;

对于每一个单词

；For each word

;

生成单词

；Generate words

;

对于每一对文档d和

；For every pair of documents d and

;

生成连接：

。Generate a connection:

.

本实施例中，通过分别确定文档网络集中各文档的文档输入表示，基于文档输入表示能有效地对各文档起到编码效果，通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，能有效地推断出各文档对应的隐藏层表示，基于隐藏层表示能有效地确定到中心文档的表示，基于中心文档的表示能有效地确定到文档-主题分布，基于文档-主题分布能有效地确定到主题-词分布，以达到对文档的主题建模效果。In this embodiment, by respectively determining the document input representation of each document in the document network set, each document can be effectively encoded based on the document input representation, and by inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing, the hidden layer representation corresponding to each document can be effectively inferred, and the representation of the central document can be effectively determined based on the hidden layer representation, and the document-topic distribution can be effectively determined based on the representation of the central document, and the topic-word distribution can be effectively determined based on the document-topic distribution, so as to achieve the topic modeling effect of the document.

实施例二Embodiment 2

请参阅图2，是本发明第二实施例提供的文档网络主题建模方法的流程图，该实施例用于对第一实施例中的步骤S20之前的步骤作进一步细化，包括步骤：Please refer to FIG. 2 , which is a flow chart of a document network topic modeling method provided by a second embodiment of the present invention. This embodiment is used to further refine the steps before step S20 in the first embodiment, including the steps of:

步骤S40，获取各样本文档的样本输入表示，并将各样本文档的样本输入表示输入所述变分邻域编码器进行编码处理，得到样本推断分布参数；Step S40, obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter;

其中，基于公式（1）至公式（3），分别获取各样本文档的样本输入表示，并将各样本文档的样本输入表示输入变分邻域编码器进行编码处理，得到样本推断分布参数；Wherein, based on formula (1) to formula (3), the sample input representation of each sample document is obtained respectively, and the sample input representation of each sample document is input into the variational neighborhood encoder for encoding processing to obtain the sample inference distribution parameter;

步骤S50，根据所述样本推断分布参数确定样本主题表示，并根据所述样本主题表示对各样本文档进行重构，得到重构文档；Step S50, determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document;

其中，通过对样本主题表示进行重参数化和注意力机制处理，得到样本主题表示，在重参数化后，变分邻域编码器使用点积注意力聚集各样本文档的邻域文档和样本主题表示，得到样本表示，随后用softmax函数将样本表示转换为样本文档-主题分布，基于样本文档-主题分布确定样本主题-词分布，基于样本主题-词分布对各样本文档进行重构，得到重构文档；The sample topic representation is obtained by reparameterizing and processing the attention mechanism on the sample topic representation. After reparameterization, the variational neighborhood encoder uses dot product attention to aggregate the neighborhood documents and sample topic representations of each sample document to obtain the sample representation. Then, the sample representation is converted into a sample document-topic distribution using a softmax function. The sample topic-word distribution is determined based on the sample document-topic distribution. Based on the sample topic-word distribution, each sample document is reconstructed to obtain a reconstructed document.

步骤S60，根据所述各样本文档的样本推断分布参数和先验正态分布参数确定先验损失，并根据各样本文档和所述重构文档确定重构损失；Step S60, determining a priori loss according to the sample inference distribution parameter of each sample document and the prior normal distribution parameter, and determining reconstruction loss according to each sample document and the reconstructed document;

其中，在模型训练阶段，对于每个文档，变分邻域编码器的损失函数分为重构损失和先验损失两个部分：重构损失为重构文档与原文档的二项交叉熵，先验损失是推断网络得到的推断分布与先验正态分布之间的KL散度，如公式(7)所示：Among them, in the model training stage, for each document, the loss function of the variational neighborhood encoder is divided into two parts: reconstruction loss and prior loss: the reconstruction loss is the binomial cross entropy between the reconstructed document and the original document, and the prior loss is the KL divergence between the inferred distribution obtained by the inference network and the prior normal distribution, as shown in formula (7):

；公式(7)

; Formula (7)

其中，

为各样本文档的邻域文档（d也是自己的邻域文档之一），

为由隐藏主题重新生成的邻域文档，KL(·)表示所述样本推断分布参数和先验正态分布参数的KL散度，μ和σ为所述变分邻域编码器中推断网络推断出的推断分布的均值与方差，所述推断分布的形式为正态分布，参数包括μ和σ，

和

为所述先验正态参数的均值与方差，

为正态分布。in,

is the neighborhood document of each sample document ( d is also one of its own neighborhood documents),

is a neighborhood document regenerated from a hidden topic, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter, μ and σ are the mean and variance of the inference distribution inferred by the inference network in the variational neighborhood encoder, the inference distribution is in the form of a normal distribution, and the parameters include μ and σ ,

and

are the mean and variance of the prior normal parameters,

is a normal distribution.

步骤S70，根据所述先验损失和所述重构损失对所述变分邻域编码器进行参数更新，直至所述变分邻域编码器收敛，得到预训练后的所述变分邻域编码器；Step S70, updating parameters of the variational neighborhood encoder according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges to obtain the pre-trained variational neighborhood encoder;

其中，若变分邻域编码器的当前迭代次数大于或等于次数阈值，则判定该变分邻域编码器收敛，该次数阈值可以根据需求进行设置。If the current iteration number of the variational neighborhood encoder is greater than or equal to the number threshold, the variational neighborhood encoder is determined to be converged, and the number threshold can be set according to requirements.

本实施例中，通过将各样本文档的样本输入表示输入变分邻域编码器进行编码处理，能有效地得到各样本文档对应的样本推断分布参数，基于样本推断分布参数能有效地确定到样本主题表示，基于样本主题表示能有效地对各样本文档进行重构，得到重构文档，基于各样本文档的样本推断分布参数和先验正态分布参数，能有效地确定到变分邻接编码器的先验损失，基于各样本文档和重构文档能有效地确定到变分邻域编码器的重构损失，基于先验损失和重构损失能有效地对变分邻域编码器进行参数更新，提高了变分邻域编码器中参数的准确性，提高了文档网络主题建模的准确性。In this embodiment, by inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing, the sample inference distribution parameters corresponding to each sample document can be effectively obtained, and the sample topic representation can be effectively determined based on the sample inference distribution parameters. Based on the sample topic representation, each sample document can be effectively reconstructed to obtain a reconstructed document. Based on the sample inference distribution parameters and the prior normal distribution parameters of each sample document, the prior loss of the variational neighboring encoder can be effectively determined. Based on each sample document and the reconstructed document, the reconstruction loss of the variational neighborhood encoder can be effectively determined. Based on the prior loss and the reconstruction loss, the variational neighborhood encoder parameters can be effectively updated, thereby improving the accuracy of the parameters in the variational neighborhood encoder and improving the accuracy of document network topic modeling.

实施例三Embodiment 3

请参阅图3，是本发明第三实施例提供的变分邻域编码器的结构示意图，包括：Please refer to FIG3 , which is a schematic diagram of the structure of a variational neighborhood encoder provided by a third embodiment of the present invention, including:

输入层，用于分别确定文档网络集中各文档的文档输入表示；其中，对于文档网络中的每一个文档，输入层目的是得到每一个文档对应的输入表示。The input layer is used to determine the document input representation of each document in the document network set respectively; wherein, for each document in the document network, the purpose of the input layer is to obtain the input representation corresponding to each document.

编码层，用于对各文档的文档输入表示进行编码处理，得到各文档的隐藏层表示，并对所述隐藏层表示进行重参数化与注意力机制处理，得到中心文档的主题表示。The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with the attention mechanism to obtain the topic representation of the central document.

其中，编码层包括编码器10和重参数化层11，编码器用于通过一个全连接层对中心文档进行编码，推断出隐藏层表示，重参数化层11用于通过重参数化与注意力机制得到中心文档的主题表示。编码器10中使用正态分布作为先验分布。The encoding layer includes an encoder 10 and a reparameterization layer 11. The encoder is used to encode the central document through a fully connected layer to infer the hidden layer representation. The reparameterization layer 11 is used to obtain the topic representation of the central document through reparameterization and attention mechanism. Normal distribution is used as the prior distribution in the encoder 10.

注意力层12，用于使用点积注意力聚集各文档的邻域文档与所述中心文档的主题表示，得到所述中心文档的表示；其中，在重参数化后，变分邻域编码器使用点积注意力聚集邻域文档与中心文档的主题表示，得到中心文档的表示。The attention layer 12 is used to use dot product attention to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document; wherein, after reparameterization, the variational neighborhood encoder uses dot product attention to aggregate the topic representation of the neighborhood documents and the central document to obtain the representation of the central document.

解码器13，用于根据所述中心文档的表示确定文档-主题分布，并根据所述文档-主题分布确定主题-词分布。The decoder 13 is used to determine the document-topic distribution according to the representation of the central document, and determine the topic-word distribution according to the document-topic distribution.

本实施例，通过分别确定文档网络集中各文档的文档输入表示，基于文档输入表示能有效地对各文档起到编码效果，通过将各文档的文档输入表示输入预训练后的变分邻域编码器进行编码处理，能有效地推断出各文档对应的隐藏层表示，基于隐藏层表示能有效地确定到中心文档的表示，基于中心文档的表示能有效地确定到文档-主题分布，基于文档-主题分布能有效地确定到主题-词分布，以达到对文档的主题建模效果。In this embodiment, by respectively determining the document input representation of each document in the document network set, each document can be effectively encoded based on the document input representation, and the hidden layer representation corresponding to each document can be effectively inferred by inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing. Based on the hidden layer representation, the representation of the central document can be effectively determined, based on the representation of the central document, the document-topic distribution can be effectively determined, and based on the document-topic distribution, the topic-word distribution can be effectively determined, so as to achieve the topic modeling effect of the document.

实施例四Embodiment 4

图4是本申请第四实施例提供的一种终端设备2的结构框图。如图4所示，该实施例的终端设备2包括：处理器20、存储器21以及存储在所述存储器21中并可在所述处理器20上运行的计算机程序22，例如文档网络主题建模方法的程序。处理器20执行所述计算机程序22时实现上述各个文档网络主题建模方法各实施例中的步骤。FIG4 is a block diagram of a terminal device 2 provided in the fourth embodiment of the present application. As shown in FIG4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21, and a computer program 22 stored in the memory 21 and executable on the processor 20, such as a program of a document network topic modeling method. When the processor 20 executes the computer program 22, the steps in each embodiment of the above-mentioned document network topic modeling method are implemented.

示例性的，所述计算机程序22可以被分割成一个或多个模块，所述一个或者多个模块被存储在所述存储器21中，并由所述处理器20执行，以完成本申请。所述一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段，该指令段用于描述所述计算机程序22在所述终端设备2中的执行过程。所述终端设备可包括，但不仅限于，处理器20、存储器21。Exemplarily, the computer program 22 may be divided into one or more modules, which are stored in the memory 21 and executed by the processor 20 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of completing specific functions, which are used to describe the execution process of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20 and a memory 21.

所称处理器20可以是中央处理单元（Central Processing Unit，CPU），还可以是其他通用处理器、数字信号处理器（Digital Signal Processor，DSP）、专用集成电路（Application Specific Integrated Circuit，ASIC）、现成可编程门阵列（Field-Programmable Gate Array，FPGA）或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 20 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc.

所述存储器21可以是所述终端设备2的内部存储单元，例如终端设备2的硬盘或内存。所述存储器21也可以是所述终端设备2的外部存储设备，例如所述终端设备2上配备的插接式硬盘，智能存储卡（Smart Media Card，SMC），安全数字（Secure Digital，SD）卡，闪存卡（Flash Card）等。进一步地，所述存储器21还可以既包括所述终端设备2的内部存储单元也包括外部存储设备。所述存储器21用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器21还可以用于暂时地存储已经输出或者将要输出的数据。The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card, etc. equipped on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used to store the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.

另外，在本申请各个实施例中的各功能模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional module in each embodiment of the present application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of software functional units.

集成的模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读存储介质中。其中，计算机可读存储介质可以是非易失性的，也可以是易失性的。基于这样的理解，本申请实现上述实施例方法中的全部或部分流程，也可以通过计算机程序来指令相关的硬件来完成，计算机程序可存储于一计算机可读存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，计算机程序包括计算机程序代码，计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括：能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器（ROM，Read-OnlyMemory）、随机存取存储器（RAM，Random Access Memory）、电载波信号、电信信号以及软件分发介质等。If the integrated module is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Among them, the computer-readable storage medium can be non-volatile or volatile. Based on this understanding, the present application implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer program can implement the steps of the above-mentioned various method embodiments when executed by the processor. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc.

以上所述实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围，均应包含在本申请的保护范围之内。The embodiments described above are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, a person skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. Such modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application, and should all be included in the protection scope of the present application.

Claims

1. A document network topic modeling method, characterized in that the method comprises the following steps:

Obtaining a document network set, and determining a document input representation of each document in the document network set;

Inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding processing to obtain a hidden layer representation of each document, and determining the representation of the central document according to the hidden layer representation;

Determining a document-topic distribution based on the representation of the central document, and determining a topic-word distribution based on the document-topic distribution;

The formula used to respectively determine the document input representation of each document in the document network set includes:

;

;

;

Among them, V represents the dictionary composed of words in the document set,

Representation Document

With Documentation

The length of the shortest path between

For Documentation

Chinese words

The number of occurrences,

is a text vector,

is a 0-1 neighborhood vector,

is a high-order neighborhood vector,

Representative words

In the documentation

The weight in

Represents the central document;

Before inputting the document input representation of each document into the pre-trained variational neighborhood encoder for encoding, the method further includes:

Obtaining a sample input representation of each sample document, and inputting the sample input representation of each sample document into the variational neighborhood encoder for encoding processing to obtain a sample inference distribution parameter;

Determining a sample topic representation according to the sample inference distribution parameter, and reconstructing each sample document according to the sample topic representation to obtain a reconstructed document;

Determine a priori loss according to the sample inference distribution parameter and the prior normal distribution parameter of each sample document, and determine reconstruction loss according to each sample document and the reconstructed document;

updating parameters of the variational neighborhood encoder according to the prior loss and the reconstruction loss until the variational neighborhood encoder converges to obtain the pre-trained variational neighborhood encoder;

The formula used for determining the prior loss according to the sample inference distribution parameters and the prior normal distribution parameters of each sample document, and determining the reconstruction loss according to each sample document and the reconstructed document includes:

;

in,

is the neighborhood document of each sample document,

is the neighborhood document regenerated from the hidden topic,

represents the total loss,

represents the reconstruction loss,

represents the prior loss,

represents the weight parameter,

are the words in the sample documents,

To reconstruct the words in the sample, KL(·) represents the KL divergence between the sample inference distribution parameter and the prior normal distribution parameter,

and

are respectively the mean and variance of the inferred distribution inferred by the inference network in the variational neighborhood encoder,

and

are the mean and variance of the prior normal distribution parameters,

is a normal distribution.

2. The document network topic modeling method according to claim 1, wherein determining the representation of the central document according to the hidden layer representation comprises:

Reparameterizing and processing the hidden layer representation with an attention mechanism to obtain a topic representation of the central document;

The dot product attention mechanism is used to aggregate the neighborhood documents of each document and the topic representation of the central document to obtain the representation of the central document.

3. The document network topic modeling method according to claim 2, characterized in that the document input representation of each document is input into the pre-trained variational neighborhood encoder for encoding processing, and the formula used to obtain the hidden layer representation of each document includes:

;

in,

represents the activation function,

,

,

and

,

represents the space of all real numbers,

is the number of topics,

is the dictionary size,

represents the logarithmic variance,

represents the hidden layer representation of the central document,

4. The document network topic modeling method according to claim 3, characterized in that the formula used to determine the document-topic distribution based on the representation of the central document includes:

;

in,

Representation and central documentation

A collection of neighboring documents with paths between them,

Representation Center Document

Neighborhood documents,

represents the standard lognormal distribution,

Indicates the degree of influence of the neighborhood document on the central document,

Central Document

With Neighborhood Documents

The shortest path length between

is the transpose of the hidden layer representation of the central document,

is the hidden layer representation of the neighborhood documents,

is the attention coefficient between the hidden layer representation of the central document and the hidden layer representation of the neighboring document,

is the document-topic distribution,

is the unnormalized central document topic representation,

Represents the normalization function.

5. A variational neighborhood encoder, characterized in that it is applied to the document network topic modeling method according to any one of claims 1 to 4, and the variational neighborhood encoder comprises:

The input layer is used to determine the document input representation of each document in the document network set respectively;

The encoding layer is used to encode the document input representation of each document to obtain the hidden layer representation of each document, and to re-parameterize and process the hidden layer representation with an attention mechanism to obtain the topic representation of the central document;

An attention layer, used to aggregate the neighborhood documents of each document and the topic representation of the central document using dot product attention to obtain the representation of the central document;

A decoder is configured to determine a document-topic distribution based on the representation of the central document, and determine a topic-word distribution based on the document-topic distribution.

6. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 4.