CN113190662A

CN113190662A - Topic segmentation method based on discourse structure diagram network

Info

Publication number: CN113190662A
Application number: CN202110384669.5A
Authority: CN
Inventors: 徐邵洋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-07-30

Abstract

The invention relates to a topic segmentation method based on a discourse structure diagram network, which comprises the following steps: step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram; step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix; and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence; and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model. The invention can simultaneously solve the problems of insufficient global semantic information modeling capability and higher calculation complexity in the prior art.

Description

Topic segmentation method based on discourse structure diagram network

Technical Field

The invention relates to the technical field of topic segmentation in data mining, in particular to a topic segmentation method based on a discourse structure diagram network.

Background

In recent years, in order to obtain a deeper understanding of natural language, research emphasis in the field of natural language processing has gradually shifted from the level of characters, words, and sentences to semantic units with larger granularity such as paragraphs and chapters, and topic segmentation has been developed unprecedentedly in the field of natural language processing, and is one of the most active research directions.

In real life, many chapters are long in length and have no definite topic division (such as news manuscripts, court records, clinical reports, and the like), and in the face of such chapters, it is difficult for readers to quickly grasp the whole content of the chapters and to comb the whole context of the chapters. The topic segmentation can solve the problem, the task aims to segment the input discourse into a paragraph with definite topic, and the segmented discourse can facilitate readers to quickly master topic distribution of the discourse and browse interesting contents. The topic segmentation models in the prior art include the following two types:

1. sequence model (TextSeg)

Considering a chapter containing n sentences, as described above, the sequence model is also a transition point for converting each sentence into a probability value indicating whether the current sentence is a topic. The first step the sequence model does is to convert each sentence into a vector. Specifically, considering a sentence containing m words, the sequence model first uses a word embedding model widely used in the field of natural language processing, which can convert a word into a vector of 300 dimensions, in which case the sentence can be represented as a matrix of size m x 300. Since each sentence is composed of words in a certain order, the sequence model needs to further capture the sequence relationship between words in the sentence for the sentence. The sequence model uses a two-way long-short memory network (Bi-LSTM) model to capture the sequence relationships between words, assuming that the output dimension of the memory network is h, when the sentence is represented as a matrix of m x h size. For each column of the matrix, the sequence model only retains the maximum value of all elements in m rows of the corresponding column to further convert the m x h sized sentence matrix into an h-dimensional vector. The above process is called max-pooling, which is widely used to obtain a vector representation of a sentence. The whole process of obtaining a vector representation of a sentence is called sentence encoding (sensor embedding).

The sequence model converts each sentence into a vector of h dimension by sentence coding. In a chapter, a single sentence contains insufficient information to be used to predict whether the sentence is a transition point of a topic, and the sequence model needs to further capture the sequence relationship between sentences within the chapter. Similarly, the sequence model uses a two-way long-and-short memory network to capture the sequence relationship between sentences, and assuming that the output dimension of the memory network is h ', each sentence is represented as an h' dimensional vector which not only contains the information in the sentence, but also fuses the information of other sentences in the chapter.

Finally, the sequence model uses a linear layer to convert each h' dimensional sentence vector into a probability value to represent whether the current sentence is a transition point of the topic.

The sequence model has the disadvantages that: the bidirectional long-and-short term memory network is used for capturing sequence relations among words in sentences and among sentences in chapters, but the sequence relations among the words are not only simple, because each sentence usually has a natural grammatical structure; moreover, the context within the chapters is also complex, not a simple sequence. In other words, we consider: the semantic information modeling capability of the sequence model is weak.

2. Global model (Hier. BERT)

The biggest difference between the global model and the sequence model is that the global model uses a transformer model to replace the original bidirectional long-time and short-time memory network model. As previously described, the two-way long-short term memory network is capable of capturing sequence relationships between words within sentences and between sentences within chapters. Specifically, for a sentence, the long and short term memory network takes each word of the sentence as input in sequence from left to right, and considering that the current input is the ith word, the memory cells in the network retain the information of all words before the word, so that for the current word, the output of the network not only contains the information of the word, but also fuses the information of all the words before. The bidirectional long-short time memory network uses two long-short time memory networks to respectively input sentences from left to right and from right to left, so that the output vector of each word not only contains the information of the word, the information of all preceding words, but also contains the information of all following words. The transformer model is different from a bidirectional long-time memory network, and for one word in a sentence, the transformer model directly and globally fuses information of all other words without being limited to sequential operations of left to right and right to left.

Although the global model using the transform model can solve the problem of insufficient semantic modeling capability of the sequence model to some extent, the model has high computational complexity and requires a large amount of computational resources when used in a topic segmentation task. In other words, most of the current hardware cannot support direct feeding of the complete chapters into the model for calculation, and needs to make: the sacrifice of cutting off the overlong sentences and dividing the overlong chapters into multiple shares.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the condition that the segmentation of the conversational question of the algorithm model in the prior art is not applicable, provide a topic segmentation method based on a chapter structure graph network, and simultaneously solve the problems of insufficient global semantic information modeling capability and high computational complexity in the prior art.

In order to solve the technical problem, the invention provides a topic segmentation method based on a discourse structure diagram network, which comprises the following steps:

step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;

step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;

and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;

and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.

In one embodiment of the present invention, in step 1, word nodes and sentence nodes are obtained by using words and sentences as nodes, respectively, and vectorization processing is performed on the word nodes and the sentence nodes.

In one embodiment of the present invention, in step 1, the vectorization process for word nodes employs a word embedding method.

In one embodiment of the invention, in step 1, the vectorization process for sentence nodes employs the max-posing method.

In one embodiment of the present invention, the feature matrix of the block diagram in step 1 is

Wherein n is⁺Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, the (n + 1) th row starts to store word nodes, the characteristic matrix

The expression method is as follows:

wherein the content of the first and second substances,

word sequence representing the ith sentence, I_iIndicating its length.

In one embodiment of the present invention, the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;

for the edges between the word nodes, the PMI indexes are adopted to measure the weight of the edges;

for the edge between the word node and the sentence node, in one sentence node, a connecting edge with the weight of 1 exists between the sentence node and each word node contained in the sentence node;

for the self-loop edge, all word nodes and sentence nodes are provided with a self-loop edge with the weight of 1;

the adjacency matrix is represented by:

in an embodiment of the present invention, the calculation formula of the PMI index is:

where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows.

In one embodiment of the present invention, the process of calculating the normalized symmetric adjacency matrix in step 2 is:

and calculating a degree matrix according to the adjacency matrix:

and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:

in an embodiment of the present invention, in step 3, the calculation process for iterating the structure map based on the gated map neural network is as follows:

z^t＝σ(W_za^t+U_zh^t-1+b_z)

r^t＝σ(W_ra^t+U_rh^t-1+b_r)

wherein h is⁰Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.

In an embodiment of the present invention, in step 4, the sentence vector representation sequence is sent to a bidirectional long and short term memory network model for training, the bidirectional long and short term memory network model uses a full link layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, it indicates that the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the topic segmentation method based on the discourse structure graph network uses a graph neural network in a topic segmentation task, firstly, each discourse is independently constructed into a structure graph, and the structure graph comprises words, sentence nodes and adjacent relations among the word nodes, the words and the sentence nodes; then, the constructed structure graph is used as input, the gate control graph neural network is used for iteration, and indirect information interaction is generated among sentence nodes through common adjacent word nodes; finally, sentence vector representation with global semantic information is obtained and sent to a bidirectional long-time and short-time memory network for prediction of segmentation points, and the problems of insufficient global semantic information modeling capability and high calculation complexity in the prior art can be solved at the same time.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart illustrating the steps of the topic segmentation method based on the discourse structure graph network according to the present invention;

FIG. 2 is a structural flow chart of the topic segmentation method based on the discourse structure graph network of the present invention;

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 1, the topic segmentation method based on the discourse structure diagram network of the present invention includes the following steps:

Referring to FIG. 2, the input end in this embodiment is a chapter containing n sentences, which is considered as a sentence sequence(s)₁，...，s_i，...，s_n) Constructing a chapter structure diagram by taking words and sentences as nodes; iterating the structure graph through a Gated Graph Neural Network (GGNN), generating indirect information interaction between sentence nodes through common adjacent word nodes, and obtaining a sentence vector representation sequence (e)₁，...，e_i，...，e_n) (ii) a The sentence vector is expressed into a sequence (e)₁，...，e_i，...，e_n) Sending the sentence into a bidirectional long-time memory network model (Bi-LSTM) for training to obtain a hidden layer vector sequence (h) of the sentence₁，...，h_i，...，h_n) Finally, the hidden layer vector (h) of each sentence is divided into a full-connected layer and a normalization layer (softmax)₁，...，h_i，...，h_n) Mapping to a probability value from 0 to 1, and if the probability value is less than 0.5, indicating that the current sentence is not a transfer point of the topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.

Specifically, in step 1, words and sentences are taken as nodes to respectively obtain word nodes and sentence nodes, the word nodes and the sentence nodes are subjected to vectorization processing, and a word embedding method is adopted for the vectorization processing of the word nodes, so that the problem of cold start is relieved and more accurate semantic information is brought; the vectorization processing of sentence nodes adopts a max-posing (maximum pooling) method.

Specifically, the feature matrix of the structure diagram in step 1 is

The expression method is as follows:

wherein the content of the first and second substances,

word sequence representing the ith sentence,/_iIndicating its length.

The structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;

for edges between word nodes, a PMI (co-occurrence information) index is adopted to measure the weight of the edges; specifically, the model uses a fixed-size sliding window to slide across all sentences of a chapter to compute the PMI index between words. For a given word pair < i, j >, the PMI index is calculated as follows:

where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows. Since negative PMI indices represent a very low degree of semantic relevance, the model only preserves edges between word nodes whose PMI indices are positive.

For an edge between a word node and a sentence node, there is a connecting edge with a weight of 1 between the sentence node and each word node contained in the sentence node.

For the self-loop edge, all word nodes and sentence nodes are provided with the self-loop edge with the weight of 1, so that each node not only can pay attention to the information of the adjacent nodes, but also can keep the learned information.

Therefore, the expression for obtaining the adjacency matrix is as follows:

specifically, the process of calculating the normalized symmetric adjacency matrix in step 2 is as follows:

and calculating a degree matrix according to the adjacency matrix:

specifically, in step 3, the calculation process of iterating the structure map based on the gated graph neural network is as follows:

z^t＝σ(W_za^t+U_zh^t-1+b_z)

r^t＝σ(W_ra^t+U_rh^t-1+b_r)

After a round of network iteration, a sentence node aggregates the information of its adjacent word nodes, and a word node aggregates the information of other adjacent word nodes with high semantic relevance. After the network iteration is carried out twice or more, the sentence nodes which are directly connected at the edges originally do not exist, and the information exchange is also indirectly generated through the shared adjacent word nodes.

Specifically, the model stores all sentence nodes in the chapter in the first n rows of the feature matrix X. After t iterations of the network, the model is taken out of h^tAs a sentence vector representation sequence (e)₁，...，e_i，...，e_n)：

1≤i≤n

In order to verify the feasibility of the topic segmentation method based on the discourse structure diagram network, the embodiment further sets a comparison experiment: a chapter named as 'Pasteur park' is selected, 24 sentence segments are provided, a park named as Pasteur is introduced, and the chapter is totally divided into 5 segments: 0-4, 5-12, 13-14, 15-19 and 20-24, wherein the labeled topic labels are respectively as follows: historically, badweil parks and vorixi valleys, commercial areas, traffic, demographic data. Topic segmentation comparison is carried out by adopting the topic segmentation method (DSG-SEG) based on the discourse structure diagram network, the sequence model (TextSeg) and the global model (Hier. BERT) in the prior art, and the segmentation result is shown in a table 1:

TABLE 1

Table 1 from top to bottom is: the segmentation prediction results of the segment segmentation with correct chapters, the global model (hier. bert) and the sequence model (TextSeg) in the prior art, and the topic segmentation method (DSG-SEG) based on the chapter structure graph network in this embodiment.

As can be seen from the table, the method of this embodiment is closer to the separation method from the actual paragraph.

In addition, the experiment also counts the operation parameters and the operation time of the three methods in the experiment, and the operation parameters and the operation time are shown in the table 2:

TABLE 2

As can be seen from the table: due to the existence of the double-layer Bi-LSTM, the parameter quantity of the sequence model (TextSeg) is large, and the training result is slow; in the global model (hier. bert), due to the quadratic computational complexity of the transformer model, the time performance of the transformer model in the topic segmentation task is still poor; in contrast, the topic segmentation method (DSG-SEG) based on the discourse structure graph network has the least number of parameters and the fastest operation speed, and the training speed is 1.6 times that of TextSeg and 4.6 times that of hier.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A topic segmentation method based on a discourse structure diagram network is characterized in that: the method comprises the following steps:

2. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 1, the word and sentence are taken as nodes to respectively obtain word nodes and sentence nodes, and the vectorization processing is carried out on the word nodes and the sentence nodes.

3. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization process for word nodes adopts a word embedding method.

4. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization processing of sentence nodes adopts the max-posing method.

5. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the feature matrix of the structure diagram in step 1 is

Wherein n is⁺Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, n+1 line start to store word node, feature matrix

The expression method is as follows:

wherein the content of the first and second substances,

word sequence representing the ith sentence,/_iIndicating its length.

6. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;

the adjacency matrix is represented by:

7. the topic segmentation method based on the discourse structure graph network as claimed in claim 6, wherein: the calculation formula of the PMI index is as follows:

8. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the process of calculating the normalized symmetric adjacency matrix in step 2 is:

and calculating a degree matrix according to the adjacency matrix:

9. the topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 3, the calculation process of iterating the structure diagram based on the gated graph neural network is as follows:

z^t＝σ(W_za^t+U_zh^t-1+b_z)

r^t＝σ(W_ra^t+U_rh^t-1+b_r)

10. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: step 4, a sentence vector representation sequence is sent to a bidirectional long-short time memory network model for training, the bidirectional long-short time memory network model uses a full connection layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.