CN113190662A - Topic segmentation method based on discourse structure diagram network - Google Patents

Topic segmentation method based on discourse structure diagram network Download PDF

Info

Publication number
CN113190662A
CN113190662A CN202110384669.5A CN202110384669A CN113190662A CN 113190662 A CN113190662 A CN 113190662A CN 202110384669 A CN202110384669 A CN 202110384669A CN 113190662 A CN113190662 A CN 113190662A
Authority
CN
China
Prior art keywords
sentence
word
nodes
matrix
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110384669.5A
Other languages
Chinese (zh)
Inventor
徐邵洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202110384669.5A priority Critical patent/CN113190662A/en
Publication of CN113190662A publication Critical patent/CN113190662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a topic segmentation method based on a discourse structure diagram network, which comprises the following steps: step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram; step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix; and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence; and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model. The invention can simultaneously solve the problems of insufficient global semantic information modeling capability and higher calculation complexity in the prior art.

Description

Topic segmentation method based on discourse structure diagram network
Technical Field
The invention relates to the technical field of topic segmentation in data mining, in particular to a topic segmentation method based on a discourse structure diagram network.
Background
In recent years, in order to obtain a deeper understanding of natural language, research emphasis in the field of natural language processing has gradually shifted from the level of characters, words, and sentences to semantic units with larger granularity such as paragraphs and chapters, and topic segmentation has been developed unprecedentedly in the field of natural language processing, and is one of the most active research directions.
In real life, many chapters are long in length and have no definite topic division (such as news manuscripts, court records, clinical reports, and the like), and in the face of such chapters, it is difficult for readers to quickly grasp the whole content of the chapters and to comb the whole context of the chapters. The topic segmentation can solve the problem, the task aims to segment the input discourse into a paragraph with definite topic, and the segmented discourse can facilitate readers to quickly master topic distribution of the discourse and browse interesting contents. The topic segmentation models in the prior art include the following two types:
1. sequence model (TextSeg)
Considering a chapter containing n sentences, as described above, the sequence model is also a transition point for converting each sentence into a probability value indicating whether the current sentence is a topic. The first step the sequence model does is to convert each sentence into a vector. Specifically, considering a sentence containing m words, the sequence model first uses a word embedding model widely used in the field of natural language processing, which can convert a word into a vector of 300 dimensions, in which case the sentence can be represented as a matrix of size m x 300. Since each sentence is composed of words in a certain order, the sequence model needs to further capture the sequence relationship between words in the sentence for the sentence. The sequence model uses a two-way long-short memory network (Bi-LSTM) model to capture the sequence relationships between words, assuming that the output dimension of the memory network is h, when the sentence is represented as a matrix of m x h size. For each column of the matrix, the sequence model only retains the maximum value of all elements in m rows of the corresponding column to further convert the m x h sized sentence matrix into an h-dimensional vector. The above process is called max-pooling, which is widely used to obtain a vector representation of a sentence. The whole process of obtaining a vector representation of a sentence is called sentence encoding (sensor embedding).
The sequence model converts each sentence into a vector of h dimension by sentence coding. In a chapter, a single sentence contains insufficient information to be used to predict whether the sentence is a transition point of a topic, and the sequence model needs to further capture the sequence relationship between sentences within the chapter. Similarly, the sequence model uses a two-way long-and-short memory network to capture the sequence relationship between sentences, and assuming that the output dimension of the memory network is h ', each sentence is represented as an h' dimensional vector which not only contains the information in the sentence, but also fuses the information of other sentences in the chapter.
Finally, the sequence model uses a linear layer to convert each h' dimensional sentence vector into a probability value to represent whether the current sentence is a transition point of the topic.
The sequence model has the disadvantages that: the bidirectional long-and-short term memory network is used for capturing sequence relations among words in sentences and among sentences in chapters, but the sequence relations among the words are not only simple, because each sentence usually has a natural grammatical structure; moreover, the context within the chapters is also complex, not a simple sequence. In other words, we consider: the semantic information modeling capability of the sequence model is weak.
2. Global model (Hier. BERT)
The biggest difference between the global model and the sequence model is that the global model uses a transformer model to replace the original bidirectional long-time and short-time memory network model. As previously described, the two-way long-short term memory network is capable of capturing sequence relationships between words within sentences and between sentences within chapters. Specifically, for a sentence, the long and short term memory network takes each word of the sentence as input in sequence from left to right, and considering that the current input is the ith word, the memory cells in the network retain the information of all words before the word, so that for the current word, the output of the network not only contains the information of the word, but also fuses the information of all the words before. The bidirectional long-short time memory network uses two long-short time memory networks to respectively input sentences from left to right and from right to left, so that the output vector of each word not only contains the information of the word, the information of all preceding words, but also contains the information of all following words. The transformer model is different from a bidirectional long-time memory network, and for one word in a sentence, the transformer model directly and globally fuses information of all other words without being limited to sequential operations of left to right and right to left.
Although the global model using the transform model can solve the problem of insufficient semantic modeling capability of the sequence model to some extent, the model has high computational complexity and requires a large amount of computational resources when used in a topic segmentation task. In other words, most of the current hardware cannot support direct feeding of the complete chapters into the model for calculation, and needs to make: the sacrifice of cutting off the overlong sentences and dividing the overlong chapters into multiple shares.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the condition that the segmentation of the conversational question of the algorithm model in the prior art is not applicable, provide a topic segmentation method based on a chapter structure graph network, and simultaneously solve the problems of insufficient global semantic information modeling capability and high computational complexity in the prior art.
In order to solve the technical problem, the invention provides a topic segmentation method based on a discourse structure diagram network, which comprises the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
In one embodiment of the present invention, in step 1, word nodes and sentence nodes are obtained by using words and sentences as nodes, respectively, and vectorization processing is performed on the word nodes and the sentence nodes.
In one embodiment of the present invention, in step 1, the vectorization process for word nodes employs a word embedding method.
In one embodiment of the invention, in step 1, the vectorization process for sentence nodes employs the max-posing method.
In one embodiment of the present invention, the feature matrix of the block diagram in step 1 is
Figure RE-GDA0003125219020000041
Wherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, the (n + 1) th row starts to store word nodes, the characteristic matrix
Figure RE-GDA0003125219020000042
The expression method is as follows:
Figure RE-GDA0003125219020000043
wherein the content of the first and second substances,
Figure RE-GDA0003125219020000044
word sequence representing the ith sentence, IiIndicating its length.
In one embodiment of the present invention, the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for the edges between the word nodes, the PMI indexes are adopted to measure the weight of the edges;
for the edge between the word node and the sentence node, in one sentence node, a connecting edge with the weight of 1 exists between the sentence node and each word node contained in the sentence node;
for the self-loop edge, all word nodes and sentence nodes are provided with a self-loop edge with the weight of 1;
the adjacency matrix is represented by:
Figure RE-GDA0003125219020000045
in an embodiment of the present invention, the calculation formula of the PMI index is:
Figure RE-GDA0003125219020000051
Figure RE-GDA0003125219020000052
Figure RE-GDA0003125219020000053
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows.
In one embodiment of the present invention, the process of calculating the normalized symmetric adjacency matrix in step 2 is:
and calculating a degree matrix according to the adjacency matrix:
Figure RE-GDA0003125219020000054
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
Figure RE-GDA0003125219020000055
in an embodiment of the present invention, in step 3, the calculation process for iterating the structure map based on the gated map neural network is as follows:
Figure RE-GDA0003125219020000056
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
Figure RE-GDA0003125219020000057
Figure RE-GDA0003125219020000058
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
In an embodiment of the present invention, in step 4, the sentence vector representation sequence is sent to a bidirectional long and short term memory network model for training, the bidirectional long and short term memory network model uses a full link layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, it indicates that the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the topic segmentation method based on the discourse structure graph network uses a graph neural network in a topic segmentation task, firstly, each discourse is independently constructed into a structure graph, and the structure graph comprises words, sentence nodes and adjacent relations among the word nodes, the words and the sentence nodes; then, the constructed structure graph is used as input, the gate control graph neural network is used for iteration, and indirect information interaction is generated among sentence nodes through common adjacent word nodes; finally, sentence vector representation with global semantic information is obtained and sent to a bidirectional long-time and short-time memory network for prediction of segmentation points, and the problems of insufficient global semantic information modeling capability and high calculation complexity in the prior art can be solved at the same time.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart illustrating the steps of the topic segmentation method based on the discourse structure graph network according to the present invention;
FIG. 2 is a structural flow chart of the topic segmentation method based on the discourse structure graph network of the present invention;
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Referring to fig. 1, the topic segmentation method based on the discourse structure diagram network of the present invention includes the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
Referring to FIG. 2, the input end in this embodiment is a chapter containing n sentences, which is considered as a sentence sequence(s)1,...,si,...,sn) Constructing a chapter structure diagram by taking words and sentences as nodes; iterating the structure graph through a Gated Graph Neural Network (GGNN), generating indirect information interaction between sentence nodes through common adjacent word nodes, and obtaining a sentence vector representation sequence (e)1,...,ei,...,en) (ii) a The sentence vector is expressed into a sequence (e)1,...,ei,...,en) Sending the sentence into a bidirectional long-time memory network model (Bi-LSTM) for training to obtain a hidden layer vector sequence (h) of the sentence1,...,hi,...,hn) Finally, the hidden layer vector (h) of each sentence is divided into a full-connected layer and a normalization layer (softmax)1,...,hi,...,hn) Mapping to a probability value from 0 to 1, and if the probability value is less than 0.5, indicating that the current sentence is not a transfer point of the topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
Specifically, in step 1, words and sentences are taken as nodes to respectively obtain word nodes and sentence nodes, the word nodes and the sentence nodes are subjected to vectorization processing, and a word embedding method is adopted for the vectorization processing of the word nodes, so that the problem of cold start is relieved and more accurate semantic information is brought; the vectorization processing of sentence nodes adopts a max-posing (maximum pooling) method.
Specifically, the feature matrix of the structure diagram in step 1 is
Figure RE-GDA0003125219020000071
Wherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, the (n + 1) th row starts to store word nodes, the characteristic matrix
Figure RE-GDA0003125219020000072
The expression method is as follows:
Figure RE-GDA0003125219020000073
wherein the content of the first and second substances,
Figure RE-GDA0003125219020000081
word sequence representing the ith sentence,/iIndicating its length.
The structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for edges between word nodes, a PMI (co-occurrence information) index is adopted to measure the weight of the edges; specifically, the model uses a fixed-size sliding window to slide across all sentences of a chapter to compute the PMI index between words. For a given word pair < i, j >, the PMI index is calculated as follows:
Figure RE-GDA0003125219020000082
Figure RE-GDA0003125219020000083
Figure RE-GDA0003125219020000084
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows. Since negative PMI indices represent a very low degree of semantic relevance, the model only preserves edges between word nodes whose PMI indices are positive.
For an edge between a word node and a sentence node, there is a connecting edge with a weight of 1 between the sentence node and each word node contained in the sentence node.
For the self-loop edge, all word nodes and sentence nodes are provided with the self-loop edge with the weight of 1, so that each node not only can pay attention to the information of the adjacent nodes, but also can keep the learned information.
Therefore, the expression for obtaining the adjacency matrix is as follows:
Figure RE-GDA0003125219020000085
specifically, the process of calculating the normalized symmetric adjacency matrix in step 2 is as follows:
and calculating a degree matrix according to the adjacency matrix:
Figure RE-GDA0003125219020000091
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
Figure RE-GDA0003125219020000092
specifically, in step 3, the calculation process of iterating the structure map based on the gated graph neural network is as follows:
Figure RE-GDA0003125219020000093
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
Figure RE-GDA0003125219020000094
Figure RE-GDA0003125219020000095
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
After a round of network iteration, a sentence node aggregates the information of its adjacent word nodes, and a word node aggregates the information of other adjacent word nodes with high semantic relevance. After the network iteration is carried out twice or more, the sentence nodes which are directly connected at the edges originally do not exist, and the information exchange is also indirectly generated through the shared adjacent word nodes.
Specifically, the model stores all sentence nodes in the chapter in the first n rows of the feature matrix X. After t iterations of the network, the model is taken out of htAs a sentence vector representation sequence (e)1,...,ei,...,en):
Figure RE-GDA0003125219020000096
1≤i≤n
In order to verify the feasibility of the topic segmentation method based on the discourse structure diagram network, the embodiment further sets a comparison experiment: a chapter named as 'Pasteur park' is selected, 24 sentence segments are provided, a park named as Pasteur is introduced, and the chapter is totally divided into 5 segments: 0-4, 5-12, 13-14, 15-19 and 20-24, wherein the labeled topic labels are respectively as follows: historically, badweil parks and vorixi valleys, commercial areas, traffic, demographic data. Topic segmentation comparison is carried out by adopting the topic segmentation method (DSG-SEG) based on the discourse structure diagram network, the sequence model (TextSeg) and the global model (Hier. BERT) in the prior art, and the segmentation result is shown in a table 1:
Figure RE-GDA0003125219020000101
TABLE 1
Table 1 from top to bottom is: the segmentation prediction results of the segment segmentation with correct chapters, the global model (hier. bert) and the sequence model (TextSeg) in the prior art, and the topic segmentation method (DSG-SEG) based on the chapter structure graph network in this embodiment.
As can be seen from the table, the method of this embodiment is closer to the separation method from the actual paragraph.
In addition, the experiment also counts the operation parameters and the operation time of the three methods in the experiment, and the operation parameters and the operation time are shown in the table 2:
Figure RE-GDA0003125219020000102
TABLE 2
As can be seen from the table: due to the existence of the double-layer Bi-LSTM, the parameter quantity of the sequence model (TextSeg) is large, and the training result is slow; in the global model (hier. bert), due to the quadratic computational complexity of the transformer model, the time performance of the transformer model in the topic segmentation task is still poor; in contrast, the topic segmentation method (DSG-SEG) based on the discourse structure graph network has the least number of parameters and the fastest operation speed, and the training speed is 1.6 times that of TextSeg and 4.6 times that of hier.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A topic segmentation method based on a discourse structure diagram network is characterized in that: the method comprises the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
2. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 1, the word and sentence are taken as nodes to respectively obtain word nodes and sentence nodes, and the vectorization processing is carried out on the word nodes and the sentence nodes.
3. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization process for word nodes adopts a word embedding method.
4. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization processing of sentence nodes adopts the max-posing method.
5. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the feature matrix of the structure diagram in step 1 is
Figure RE-FDA0003125219010000011
Wherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, n+1 line start to store word node, feature matrix
Figure RE-FDA0003125219010000012
The expression method is as follows:
Figure RE-FDA0003125219010000021
wherein the content of the first and second substances,
Figure RE-FDA0003125219010000022
word sequence representing the ith sentence,/iIndicating its length.
6. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for the edges between the word nodes, the PMI indexes are adopted to measure the weight of the edges;
for the edge between the word node and the sentence node, in one sentence node, a connecting edge with the weight of 1 exists between the sentence node and each word node contained in the sentence node;
for the self-loop edge, all word nodes and sentence nodes are provided with a self-loop edge with the weight of 1;
the adjacency matrix is represented by:
Figure RE-FDA0003125219010000023
7. the topic segmentation method based on the discourse structure graph network as claimed in claim 6, wherein: the calculation formula of the PMI index is as follows:
Figure RE-FDA0003125219010000024
Figure RE-FDA0003125219010000025
Figure RE-FDA0003125219010000026
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows.
8. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the process of calculating the normalized symmetric adjacency matrix in step 2 is:
and calculating a degree matrix according to the adjacency matrix:
Figure RE-FDA0003125219010000031
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
Figure RE-FDA0003125219010000032
9. the topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 3, the calculation process of iterating the structure diagram based on the gated graph neural network is as follows:
Figure RE-FDA0003125219010000033
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
Figure RE-FDA0003125219010000034
Figure RE-FDA0003125219010000035
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
10. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: step 4, a sentence vector representation sequence is sent to a bidirectional long-short time memory network model for training, the bidirectional long-short time memory network model uses a full connection layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
CN202110384669.5A 2021-04-09 2021-04-09 Topic segmentation method based on discourse structure diagram network Pending CN113190662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110384669.5A CN113190662A (en) 2021-04-09 2021-04-09 Topic segmentation method based on discourse structure diagram network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110384669.5A CN113190662A (en) 2021-04-09 2021-04-09 Topic segmentation method based on discourse structure diagram network

Publications (1)

Publication Number Publication Date
CN113190662A true CN113190662A (en) 2021-07-30

Family

ID=76975411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110384669.5A Pending CN113190662A (en) 2021-04-09 2021-04-09 Topic segmentation method based on discourse structure diagram network

Country Status (1)

Country Link
CN (1) CN113190662A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972366A (en) * 2022-07-27 2022-08-30 山东大学 Full-automatic segmentation method and system for cerebral cortex surface based on graph network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170278510A1 (en) * 2016-03-22 2017-09-28 Sony Corporation Electronic device, method and training method for natural language processing
CN111695341A (en) * 2020-06-16 2020-09-22 北京理工大学 Implicit discourse relation analysis method and system based on discourse structure diagram convolution
CN112487189A (en) * 2020-12-08 2021-03-12 武汉大学 Implicit discourse text relation classification method for graph-volume network enhancement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170278510A1 (en) * 2016-03-22 2017-09-28 Sony Corporation Electronic device, method and training method for natural language processing
CN111695341A (en) * 2020-06-16 2020-09-22 北京理工大学 Implicit discourse relation analysis method and system based on discourse structure diagram convolution
CN112487189A (en) * 2020-12-08 2021-03-12 武汉大学 Implicit discourse text relation classification method for graph-volume network enhancement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐邵洋 等: "基于篇章结构图网络的话题分割", 《GITHUB》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972366A (en) * 2022-07-27 2022-08-30 山东大学 Full-automatic segmentation method and system for cerebral cortex surface based on graph network
CN114972366B (en) * 2022-07-27 2022-11-18 山东大学 Full-automatic segmentation method and system for cerebral cortex surface based on graph network

Similar Documents

Publication Publication Date Title
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words
CN110619121B (en) Entity relation extraction method based on improved depth residual error network and attention mechanism
CN113360646B (en) Text generation method, device and storage medium based on dynamic weight
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN112100397A (en) Electric power plan knowledge graph construction method and system based on bidirectional gating circulation unit
CN110516034A (en) Blog management method, device, the network equipment and readable storage medium storing program for executing
CN113378573A (en) Content big data oriented small sample relation extraction method and device
CN111191825A (en) User default prediction method and device and electronic equipment
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
CN116383399A (en) Event public opinion risk prediction method and system
CN113761868A (en) Text processing method and device, electronic equipment and readable storage medium
CN113920379B (en) Zero sample image classification method based on knowledge assistance
CN112417890B (en) Fine granularity entity classification method based on diversified semantic attention model
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN113190662A (en) Topic segmentation method based on discourse structure diagram network
CN113268985A (en) Relationship path-based remote supervision relationship extraction method, device and medium
CN116431816B (en) Document classification method, apparatus, device and computer readable storage medium
CN113836308B (en) Network big data long text multi-label classification method, system, device and medium
WO2023137918A1 (en) Text data analysis method and apparatus, model training method, and computer device
CN115640399A (en) Text classification method, device, equipment and storage medium
CN115906846A (en) Document-level named entity identification method based on double-graph hierarchical feature fusion
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
CN112528015B (en) Method and device for judging rumor in message interactive transmission
CN115081609A (en) Acceleration method in intelligent decision, terminal equipment and storage medium
CN112926340A (en) Semantic matching model for knowledge point positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210730

RJ01 Rejection of invention patent application after publication