CN113190662A - Topic segmentation method based on discourse structure diagram network - Google Patents
Topic segmentation method based on discourse structure diagram network Download PDFInfo
- Publication number
- CN113190662A CN113190662A CN202110384669.5A CN202110384669A CN113190662A CN 113190662 A CN113190662 A CN 113190662A CN 202110384669 A CN202110384669 A CN 202110384669A CN 113190662 A CN113190662 A CN 113190662A
- Authority
- CN
- China
- Prior art keywords
- sentence
- word
- nodes
- matrix
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000011218 segmentation Effects 0.000 title claims abstract description 37
- 238000010586 diagram Methods 0.000 title claims abstract description 33
- 239000011159 matrix material Substances 0.000 claims abstract description 52
- 230000015654 memory Effects 0.000 claims abstract description 21
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 238000013528 artificial neural network Methods 0.000 claims abstract description 10
- 230000003993 interaction Effects 0.000 claims abstract description 6
- 238000000926 separation method Methods 0.000 claims abstract description 5
- 230000008569 process Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 3
- 230000007787 long-term memory Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a topic segmentation method based on a discourse structure diagram network, which comprises the following steps: step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram; step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix; and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence; and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model. The invention can simultaneously solve the problems of insufficient global semantic information modeling capability and higher calculation complexity in the prior art.
Description
Technical Field
The invention relates to the technical field of topic segmentation in data mining, in particular to a topic segmentation method based on a discourse structure diagram network.
Background
In recent years, in order to obtain a deeper understanding of natural language, research emphasis in the field of natural language processing has gradually shifted from the level of characters, words, and sentences to semantic units with larger granularity such as paragraphs and chapters, and topic segmentation has been developed unprecedentedly in the field of natural language processing, and is one of the most active research directions.
In real life, many chapters are long in length and have no definite topic division (such as news manuscripts, court records, clinical reports, and the like), and in the face of such chapters, it is difficult for readers to quickly grasp the whole content of the chapters and to comb the whole context of the chapters. The topic segmentation can solve the problem, the task aims to segment the input discourse into a paragraph with definite topic, and the segmented discourse can facilitate readers to quickly master topic distribution of the discourse and browse interesting contents. The topic segmentation models in the prior art include the following two types:
1. sequence model (TextSeg)
Considering a chapter containing n sentences, as described above, the sequence model is also a transition point for converting each sentence into a probability value indicating whether the current sentence is a topic. The first step the sequence model does is to convert each sentence into a vector. Specifically, considering a sentence containing m words, the sequence model first uses a word embedding model widely used in the field of natural language processing, which can convert a word into a vector of 300 dimensions, in which case the sentence can be represented as a matrix of size m x 300. Since each sentence is composed of words in a certain order, the sequence model needs to further capture the sequence relationship between words in the sentence for the sentence. The sequence model uses a two-way long-short memory network (Bi-LSTM) model to capture the sequence relationships between words, assuming that the output dimension of the memory network is h, when the sentence is represented as a matrix of m x h size. For each column of the matrix, the sequence model only retains the maximum value of all elements in m rows of the corresponding column to further convert the m x h sized sentence matrix into an h-dimensional vector. The above process is called max-pooling, which is widely used to obtain a vector representation of a sentence. The whole process of obtaining a vector representation of a sentence is called sentence encoding (sensor embedding).
The sequence model converts each sentence into a vector of h dimension by sentence coding. In a chapter, a single sentence contains insufficient information to be used to predict whether the sentence is a transition point of a topic, and the sequence model needs to further capture the sequence relationship between sentences within the chapter. Similarly, the sequence model uses a two-way long-and-short memory network to capture the sequence relationship between sentences, and assuming that the output dimension of the memory network is h ', each sentence is represented as an h' dimensional vector which not only contains the information in the sentence, but also fuses the information of other sentences in the chapter.
Finally, the sequence model uses a linear layer to convert each h' dimensional sentence vector into a probability value to represent whether the current sentence is a transition point of the topic.
The sequence model has the disadvantages that: the bidirectional long-and-short term memory network is used for capturing sequence relations among words in sentences and among sentences in chapters, but the sequence relations among the words are not only simple, because each sentence usually has a natural grammatical structure; moreover, the context within the chapters is also complex, not a simple sequence. In other words, we consider: the semantic information modeling capability of the sequence model is weak.
2. Global model (Hier. BERT)
The biggest difference between the global model and the sequence model is that the global model uses a transformer model to replace the original bidirectional long-time and short-time memory network model. As previously described, the two-way long-short term memory network is capable of capturing sequence relationships between words within sentences and between sentences within chapters. Specifically, for a sentence, the long and short term memory network takes each word of the sentence as input in sequence from left to right, and considering that the current input is the ith word, the memory cells in the network retain the information of all words before the word, so that for the current word, the output of the network not only contains the information of the word, but also fuses the information of all the words before. The bidirectional long-short time memory network uses two long-short time memory networks to respectively input sentences from left to right and from right to left, so that the output vector of each word not only contains the information of the word, the information of all preceding words, but also contains the information of all following words. The transformer model is different from a bidirectional long-time memory network, and for one word in a sentence, the transformer model directly and globally fuses information of all other words without being limited to sequential operations of left to right and right to left.
Although the global model using the transform model can solve the problem of insufficient semantic modeling capability of the sequence model to some extent, the model has high computational complexity and requires a large amount of computational resources when used in a topic segmentation task. In other words, most of the current hardware cannot support direct feeding of the complete chapters into the model for calculation, and needs to make: the sacrifice of cutting off the overlong sentences and dividing the overlong chapters into multiple shares.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the condition that the segmentation of the conversational question of the algorithm model in the prior art is not applicable, provide a topic segmentation method based on a chapter structure graph network, and simultaneously solve the problems of insufficient global semantic information modeling capability and high computational complexity in the prior art.
In order to solve the technical problem, the invention provides a topic segmentation method based on a discourse structure diagram network, which comprises the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
In one embodiment of the present invention, in step 1, word nodes and sentence nodes are obtained by using words and sentences as nodes, respectively, and vectorization processing is performed on the word nodes and the sentence nodes.
In one embodiment of the present invention, in step 1, the vectorization process for word nodes employs a word embedding method.
In one embodiment of the invention, in step 1, the vectorization process for sentence nodes employs the max-posing method.
In one embodiment of the present invention, the feature matrix of the block diagram in step 1 isWherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, the (n + 1) th row starts to store word nodes, the characteristic matrixThe expression method is as follows:
wherein the content of the first and second substances,word sequence representing the ith sentence, IiIndicating its length.
In one embodiment of the present invention, the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for the edges between the word nodes, the PMI indexes are adopted to measure the weight of the edges;
for the edge between the word node and the sentence node, in one sentence node, a connecting edge with the weight of 1 exists between the sentence node and each word node contained in the sentence node;
for the self-loop edge, all word nodes and sentence nodes are provided with a self-loop edge with the weight of 1;
the adjacency matrix is represented by:
in an embodiment of the present invention, the calculation formula of the PMI index is:
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows.
In one embodiment of the present invention, the process of calculating the normalized symmetric adjacency matrix in step 2 is:
and calculating a degree matrix according to the adjacency matrix:
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
in an embodiment of the present invention, in step 3, the calculation process for iterating the structure map based on the gated map neural network is as follows:
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
In an embodiment of the present invention, in step 4, the sentence vector representation sequence is sent to a bidirectional long and short term memory network model for training, the bidirectional long and short term memory network model uses a full link layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, it indicates that the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the topic segmentation method based on the discourse structure graph network uses a graph neural network in a topic segmentation task, firstly, each discourse is independently constructed into a structure graph, and the structure graph comprises words, sentence nodes and adjacent relations among the word nodes, the words and the sentence nodes; then, the constructed structure graph is used as input, the gate control graph neural network is used for iteration, and indirect information interaction is generated among sentence nodes through common adjacent word nodes; finally, sentence vector representation with global semantic information is obtained and sent to a bidirectional long-time and short-time memory network for prediction of segmentation points, and the problems of insufficient global semantic information modeling capability and high calculation complexity in the prior art can be solved at the same time.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart illustrating the steps of the topic segmentation method based on the discourse structure graph network according to the present invention;
FIG. 2 is a structural flow chart of the topic segmentation method based on the discourse structure graph network of the present invention;
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Referring to fig. 1, the topic segmentation method based on the discourse structure diagram network of the present invention includes the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
Referring to FIG. 2, the input end in this embodiment is a chapter containing n sentences, which is considered as a sentence sequence(s)1,...,si,...,sn) Constructing a chapter structure diagram by taking words and sentences as nodes; iterating the structure graph through a Gated Graph Neural Network (GGNN), generating indirect information interaction between sentence nodes through common adjacent word nodes, and obtaining a sentence vector representation sequence (e)1,...,ei,...,en) (ii) a The sentence vector is expressed into a sequence (e)1,...,ei,...,en) Sending the sentence into a bidirectional long-time memory network model (Bi-LSTM) for training to obtain a hidden layer vector sequence (h) of the sentence1,...,hi,...,hn) Finally, the hidden layer vector (h) of each sentence is divided into a full-connected layer and a normalization layer (softmax)1,...,hi,...,hn) Mapping to a probability value from 0 to 1, and if the probability value is less than 0.5, indicating that the current sentence is not a transfer point of the topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
Specifically, in step 1, words and sentences are taken as nodes to respectively obtain word nodes and sentence nodes, the word nodes and the sentence nodes are subjected to vectorization processing, and a word embedding method is adopted for the vectorization processing of the word nodes, so that the problem of cold start is relieved and more accurate semantic information is brought; the vectorization processing of sentence nodes adopts a max-posing (maximum pooling) method.
Specifically, the feature matrix of the structure diagram in step 1 isWherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, the (n + 1) th row starts to store word nodes, the characteristic matrixThe expression method is as follows:
wherein the content of the first and second substances,word sequence representing the ith sentence,/iIndicating its length.
The structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for edges between word nodes, a PMI (co-occurrence information) index is adopted to measure the weight of the edges; specifically, the model uses a fixed-size sliding window to slide across all sentences of a chapter to compute the PMI index between words. For a given word pair < i, j >, the PMI index is calculated as follows:
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows. Since negative PMI indices represent a very low degree of semantic relevance, the model only preserves edges between word nodes whose PMI indices are positive.
For an edge between a word node and a sentence node, there is a connecting edge with a weight of 1 between the sentence node and each word node contained in the sentence node.
For the self-loop edge, all word nodes and sentence nodes are provided with the self-loop edge with the weight of 1, so that each node not only can pay attention to the information of the adjacent nodes, but also can keep the learned information.
Therefore, the expression for obtaining the adjacency matrix is as follows:
specifically, the process of calculating the normalized symmetric adjacency matrix in step 2 is as follows:
and calculating a degree matrix according to the adjacency matrix:
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
specifically, in step 3, the calculation process of iterating the structure map based on the gated graph neural network is as follows:
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
After a round of network iteration, a sentence node aggregates the information of its adjacent word nodes, and a word node aggregates the information of other adjacent word nodes with high semantic relevance. After the network iteration is carried out twice or more, the sentence nodes which are directly connected at the edges originally do not exist, and the information exchange is also indirectly generated through the shared adjacent word nodes.
Specifically, the model stores all sentence nodes in the chapter in the first n rows of the feature matrix X. After t iterations of the network, the model is taken out of htAs a sentence vector representation sequence (e)1,...,ei,...,en):
In order to verify the feasibility of the topic segmentation method based on the discourse structure diagram network, the embodiment further sets a comparison experiment: a chapter named as 'Pasteur park' is selected, 24 sentence segments are provided, a park named as Pasteur is introduced, and the chapter is totally divided into 5 segments: 0-4, 5-12, 13-14, 15-19 and 20-24, wherein the labeled topic labels are respectively as follows: historically, badweil parks and vorixi valleys, commercial areas, traffic, demographic data. Topic segmentation comparison is carried out by adopting the topic segmentation method (DSG-SEG) based on the discourse structure diagram network, the sequence model (TextSeg) and the global model (Hier. BERT) in the prior art, and the segmentation result is shown in a table 1:
TABLE 1
Table 1 from top to bottom is: the segmentation prediction results of the segment segmentation with correct chapters, the global model (hier. bert) and the sequence model (TextSeg) in the prior art, and the topic segmentation method (DSG-SEG) based on the chapter structure graph network in this embodiment.
As can be seen from the table, the method of this embodiment is closer to the separation method from the actual paragraph.
In addition, the experiment also counts the operation parameters and the operation time of the three methods in the experiment, and the operation parameters and the operation time are shown in the table 2:
TABLE 2
As can be seen from the table: due to the existence of the double-layer Bi-LSTM, the parameter quantity of the sequence model (TextSeg) is large, and the training result is slow; in the global model (hier. bert), due to the quadratic computational complexity of the transformer model, the time performance of the transformer model in the topic segmentation task is still poor; in contrast, the topic segmentation method (DSG-SEG) based on the discourse structure graph network has the least number of parameters and the fastest operation speed, and the training speed is 1.6 times that of TextSeg and 4.6 times that of hier.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.
Claims (10)
1. A topic segmentation method based on a discourse structure diagram network is characterized in that: the method comprises the following steps:
step 1: constructing a chapter structure diagram, and taking words and sentences as nodes to obtain a characteristic matrix and an adjacent matrix of the structure diagram;
step 2: performing integrated calculation on the characteristic matrix and the adjacency matrix to obtain a normalized symmetrical adjacency matrix;
and step 3: iterating the structure graph based on a gated graph neural network, and generating indirect information interaction between sentence nodes through common adjacent word nodes to obtain a sentence vector representation sequence;
and 4, step 4: and predicting the separation points of the sentence vector representation sequence based on a bidirectional long-time and short-time memory network model.
2. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 1, the word and sentence are taken as nodes to respectively obtain word nodes and sentence nodes, and the vectorization processing is carried out on the word nodes and the sentence nodes.
3. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization process for word nodes adopts a word embedding method.
4. The topic segmentation method based on the discourse structure graph network as claimed in claim 2, wherein: in step 1, the vectorization processing of sentence nodes adopts the max-posing method.
5. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the feature matrix of the structure diagram in step 1 isWherein n is+Representing the total number of word and sentence nodes, and m represents the characteristic dimension of the node; the first n rows of X store n sentence nodes, n+1 line start to store word node, feature matrixThe expression method is as follows:
6. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the structure diagram in step 1 includes three adjacent edges: edges among word nodes, edges among word nodes and sentence nodes, and self-looping edges;
for the edges between the word nodes, the PMI indexes are adopted to measure the weight of the edges;
for the edge between the word node and the sentence node, in one sentence node, a connecting edge with the weight of 1 exists between the sentence node and each word node contained in the sentence node;
for the self-loop edge, all word nodes and sentence nodes are provided with a self-loop edge with the weight of 1;
the adjacency matrix is represented by:
7. the topic segmentation method based on the discourse structure graph network as claimed in claim 6, wherein: the calculation formula of the PMI index is as follows:
where, # W denotes the number of all the sliding windows, # W (i) denotes the number of windows in which the word i appears, # W (i, j) denotes the number of windows in which the word i and the word j appear simultaneously, # W (i), p (j) denote the frequency of appearance of the word i and the word j, respectively, in all the sliding windows, and p (i, j) denotes the frequency of appearance of the word i and the word j, respectively, in all the sliding windows.
8. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: the process of calculating the normalized symmetric adjacency matrix in step 2 is:
and calculating a degree matrix according to the adjacency matrix:
and (3) calculating by adopting the degree matrix D and the adjacency matrix A to obtain a normalized symmetric adjacency matrix:
9. the topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: in step 3, the calculation process of iterating the structure diagram based on the gated graph neural network is as follows:
zt=σ(Wzat+Uzht-1+bz)
rt=σ(Wrat+Urht-1+br)
wherein h is0Where σ is a sigmoid function, all W, U and b are trainable parameters, and z and r represent update gates, respectively, to determine how much adjacency information contributes to the current node.
10. The topic segmentation method based on the discourse structure graph network as claimed in claim 1, wherein: step 4, a sentence vector representation sequence is sent to a bidirectional long-short time memory network model for training, the bidirectional long-short time memory network model uses a full connection layer and a normalization layer to map the hidden layer vector of each sentence to a probability value from 0 to 1, and if the probability value is less than 0.5, the current sentence is not a transfer point of a topic; if the probability value is more than 0.5, the transition point of the topic appears before and after the current sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110384669.5A CN113190662A (en) | 2021-04-09 | 2021-04-09 | Topic segmentation method based on discourse structure diagram network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110384669.5A CN113190662A (en) | 2021-04-09 | 2021-04-09 | Topic segmentation method based on discourse structure diagram network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113190662A true CN113190662A (en) | 2021-07-30 |
Family
ID=76975411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110384669.5A Pending CN113190662A (en) | 2021-04-09 | 2021-04-09 | Topic segmentation method based on discourse structure diagram network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113190662A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972366A (en) * | 2022-07-27 | 2022-08-30 | 山东大学 | Full-automatic segmentation method and system for cerebral cortex surface based on graph network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170278510A1 (en) * | 2016-03-22 | 2017-09-28 | Sony Corporation | Electronic device, method and training method for natural language processing |
CN111695341A (en) * | 2020-06-16 | 2020-09-22 | 北京理工大学 | Implicit discourse relation analysis method and system based on discourse structure diagram convolution |
CN112487189A (en) * | 2020-12-08 | 2021-03-12 | 武汉大学 | Implicit discourse text relation classification method for graph-volume network enhancement |
-
2021
- 2021-04-09 CN CN202110384669.5A patent/CN113190662A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170278510A1 (en) * | 2016-03-22 | 2017-09-28 | Sony Corporation | Electronic device, method and training method for natural language processing |
CN111695341A (en) * | 2020-06-16 | 2020-09-22 | 北京理工大学 | Implicit discourse relation analysis method and system based on discourse structure diagram convolution |
CN112487189A (en) * | 2020-12-08 | 2021-03-12 | 武汉大学 | Implicit discourse text relation classification method for graph-volume network enhancement |
Non-Patent Citations (1)
Title |
---|
徐邵洋 等: "基于篇章结构图网络的话题分割", 《GITHUB》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972366A (en) * | 2022-07-27 | 2022-08-30 | 山东大学 | Full-automatic segmentation method and system for cerebral cortex surface based on graph network |
CN114972366B (en) * | 2022-07-27 | 2022-11-18 | 山东大学 | Full-automatic segmentation method and system for cerebral cortex surface based on graph network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113392651B (en) | Method, device, equipment and medium for training word weight model and extracting core words | |
CN110619121B (en) | Entity relation extraction method based on improved depth residual error network and attention mechanism | |
CN113360646B (en) | Text generation method, device and storage medium based on dynamic weight | |
CN109471944A (en) | Training method, device and the readable storage medium storing program for executing of textual classification model | |
CN112100397A (en) | Electric power plan knowledge graph construction method and system based on bidirectional gating circulation unit | |
CN110516034A (en) | Blog management method, device, the network equipment and readable storage medium storing program for executing | |
CN113378573A (en) | Content big data oriented small sample relation extraction method and device | |
CN111191825A (en) | User default prediction method and device and electronic equipment | |
CN116245107B (en) | Electric power audit text entity identification method, device, equipment and storage medium | |
CN116383399A (en) | Event public opinion risk prediction method and system | |
CN113761868A (en) | Text processing method and device, electronic equipment and readable storage medium | |
CN113920379B (en) | Zero sample image classification method based on knowledge assistance | |
CN112417890B (en) | Fine granularity entity classification method based on diversified semantic attention model | |
US20220156489A1 (en) | Machine learning techniques for identifying logical sections in unstructured data | |
CN113190662A (en) | Topic segmentation method based on discourse structure diagram network | |
CN113268985A (en) | Relationship path-based remote supervision relationship extraction method, device and medium | |
CN116431816B (en) | Document classification method, apparatus, device and computer readable storage medium | |
CN113836308B (en) | Network big data long text multi-label classification method, system, device and medium | |
WO2023137918A1 (en) | Text data analysis method and apparatus, model training method, and computer device | |
CN115640399A (en) | Text classification method, device, equipment and storage medium | |
CN115906846A (en) | Document-level named entity identification method based on double-graph hierarchical feature fusion | |
CN113987536A (en) | Method and device for determining security level of field in data table, electronic equipment and medium | |
CN112528015B (en) | Method and device for judging rumor in message interactive transmission | |
CN115081609A (en) | Acceleration method in intelligent decision, terminal equipment and storage medium | |
CN112926340A (en) | Semantic matching model for knowledge point positioning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210730 |
|
RJ01 | Rejection of invention patent application after publication |