CN114398867B - Two-stage long text similarity calculation method - Google Patents
Two-stage long text similarity calculation method Download PDFInfo
- Publication number
- CN114398867B CN114398867B CN202210298133.6A CN202210298133A CN114398867B CN 114398867 B CN114398867 B CN 114398867B CN 202210298133 A CN202210298133 A CN 202210298133A CN 114398867 B CN114398867 B CN 114398867B
- Authority
- CN
- China
- Prior art keywords
- sentence
- similarity
- text
- long
- similar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 143
- 238000001514 detection method Methods 0.000 claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000013136 deep learning model Methods 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 8
- 230000004927 fusion Effects 0.000 claims abstract description 3
- 238000012549 training Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 16
- 230000000052 comparative effect Effects 0.000 claims description 11
- 230000002776 aggregation Effects 0.000 claims description 10
- 238000004220 aggregation Methods 0.000 claims description 10
- 101000762967 Homo sapiens Lymphokine-activated killer T-cell-originated protein kinase Proteins 0.000 claims description 9
- 102100026753 Lymphokine-activated killer T-cell-originated protein kinase Human genes 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000013518 transcription Methods 0.000 claims description 6
- 230000035897 transcription Effects 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000006798 recombination Effects 0.000 claims description 2
- 238000005215 recombination Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000008521 reorganization Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a two-stage long text similarity calculation method, which comprises the steps of constructing a sentence vector extraction model based on a deep learning model in a first stage similar sentence detection stage, and converting a text into a sentence vector; detecting to obtain multiple similar sentence pairs of similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.
Description
Technical Field
The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
Background
Text similarity calculation is an important task of natural language processing, and related technologies aim to measure the similarity between texts by using technical means. For texts with different lengths, different text similarity calculation methods need to be adapted. When the similarity of long texts is calculated, a large amount of text information needs to be extracted, compressed and matched for calculation, and the method has important application in the aspects of news recommendation, article recommendation, quotation recommendation, document clustering and the like.
In the prior art, a method based on keyword extraction is mostly adopted, a few keywords are extracted to serve as representatives of long texts, and then further similarity calculation is carried out. Since the calculation result depends on a few keywords, a great deal of semantic information is lost by the method, and the robustness is poor.
The deep learning model-based method uses a deep learning model to encode the full text and then calculates the similarity of the full text. However, the existing deep learning model can only achieve a good coding effect on text sequences with the length within hundreds of words. Long texts like books often have tens of thousands of characters or even hundreds of thousands of characters, and the existing models cannot code well. Moreover, since the similarity calculation is performed in a hidden space, the interpretability is poor.
In addition, the two technologies only consider information between compared long texts, and the calculation process is relatively isolated and lacks of utilization of group information.
Disclosure of Invention
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
The principle of the invention is as follows: for a group of long texts, in the first stage, detecting by using a plurality of detection methods to obtain similar sentence pairs among each long text; in the second stage, similar sentences are combined and summarized according to the long texts where the similar sentences are located, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, obtaining the text similarity between the long texts according to the node characteristics.
The technical scheme provided by the invention is as follows:
a two-stage long text similarity calculation method comprises the following steps:
in a first stage, a similar sentence detection stage, comprising:
constructing a sentence vector extraction model based on the deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transcription similarity detection model;
Converting the text into sentence vectors through a sentence vector extraction model;
detecting to obtain similar sentence pairs of multiple similar types between each long text by using multiple detection methods;
in the second stage, the graph structure calculation stage includes:
calculating to obtain basic similarity;
based on a graph algorithm, representing the long text similar sentence pairs and the basic similarity into a similar sentence relation graph; each node on the similar sentence relation graph represents a long text;
obtaining high-level node representation of fusion group information through inference interactive operation of a similar sentence relation graph;
updating node feature information, wherein the value of each dimension on the node feature vector is the text similarity between corresponding long texts;
and according to the node characteristics, obtaining the text similarity between the long texts.
Further, before the similar sentence detection stage, the two-stage long text similarity calculation method firstly divides each long text into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of the long text sentences and the clauses through a semantic similarity detection model and a transfer similarity detection model which are included in the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
Further, a sentence vector extraction model is obtained through the following steps:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the extracted sentence vectors by adopting a discarding method, and constructing a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
the loss function used for training adopts a loss function calculated on the basis of a positive example and a negative example of a sentence vector sum structure;
naming the trained model as a semantic similarity detection model;
12) performing sentence transferring similarity comparison learning training, and fine-tuning a BERT model to obtain a transferring similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the loss function for the BERT model fine tuning comprisesAnd;the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function ;
Final loss functionComprises the following steps:(ii) a Wherein,the hyper-parameters are needed to be set and are used for adjusting the emphasis degree between sentence structure recombination and semantic difference of the model;
the obtained model is named as a similar detection transfer model.
Furthermore, the multiple detection methods in the first stage comprise three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similar sentence pairs, rephrase similar sentence pairs and local similar sentence pairs.
A. When detecting the semantically similar sentence pair, the following operations are executed:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as;
A3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less thanThe sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B2. extracting feature vectors of all sentences by using a similar detection model of the transcription, and recording the feature vectors as;
B3. Feature vector to sentenceDe-repeating to obtain(ii) a For each feature vector, finding TOPK similar vectors; all vectors obtained are pair-counted as;
B4. Calculating outThe t percentile of the medium vector distance is used as a similarity threshold value;
B5. Filtering out ofDistance of medium feature vector is less thanThe sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. using semantic similarityExtracting feature vectors of all clauses by the detection model, and recording the feature vectors as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
Furthermore, after the detection results are merged and summarized by the three types of similar sentences, the numerical values are subjected to standardization processing according to the total length of the text, and the basic similarity of the long text is obtained.
Further, the calculation of the basic similarity is:
is provided with two long texts,Detecting to detectAndinIf the sentences are similar, the basic similarity of the two long textsCalculated as follows:
Further, the long text and the basic similarity thereof are expressed into a similar sentence relation graph(ii) a Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts,If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
Further, information transmission and aggregation operation are carried out twice on the relational graph to obtain new node characteristic informationAnd updating; the calculation method is as follows:
wherein,、is shown in the figureUpper node、An initial feature vector value of; The weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images;、is shown as a drawingUpper node、The feature vector value after the first updating; finally, the node feature vector is obtainedWherein, in the process,represents a long textAnd long textThe text similarity of (3).
Compared with the prior art, the invention has the following beneficial effects:
by utilizing the technical scheme provided by the invention, when the similarity of the long text is calculated, the long text is divided into fine-grained sentences for coding and comparing, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and enabling the nodes to represent and fuse group information through information propagation and aggregation on the graph; meanwhile, similar sentences can be visually checked, the similarity calculation method of the long text provided by the invention can enable the similarity of the long text to have stronger interpretability, and the effectiveness and the precision of text processing are improved.
Drawings
Fig. 1 is a block diagram of a two-stage process for calculating similarity of long texts according to the present invention.
FIG. 2 is a block flow diagram of the similar sentence detection stage of the method of the present invention.
FIG. 3 is a flow diagram of the graph structure calculation phase of the method of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm. For a group of long texts, in a first stage, detecting similar sentence pairs between each long text by using a plurality of detection paths; in the second stage, the matched sentence pairs are aggregated into a graph according to the source of the sentence pairs, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, obtaining the text similarity between the long texts according to the node characteristics.
Fig. 1 shows a process of calculating similarity of a long text based on two stages of a deep learning model and a graph algorithm according to the present invention. The method comprises the following steps:
the first stage is a similar sentence detection stage:
1) and constructing a sentence vector extraction model based on a deep learning model (a BERT model or a RoBERTA model can be used), and extracting a sentence vector for sentences and clauses in the long text.
2) And detecting to obtain various types of similar sentence pairs according to the similarity of the sentence vectors.
The second phase is the graph structure calculation phase:
3) the similar sentence pairs are constructed into a graph structure according to the source (in long text).
4) And carrying out information transmission and aggregation operation on the graph, and updating the node characteristic information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The method comprises the following concrete implementation steps:
1) the long text is divided into sentences according to punctuations representing the sentence division, and is divided into clauses according to commas inside the sentences. Sentence vectors of sentences and sub-sentences are respectively extracted by using sentence vector extraction models (including a semantic similarity detection model and a transcription similarity detection model) which are adjusted by contrast learning.
2) And detecting similar sentence vectors according to the distance of the sentence vectors aiming at three types of sentence text similar modes of semantic similar type, rephrase similar type and local similar type to obtain corresponding similar sentence pairs.
3) And after the detection results of the three types of similar sentences are merged and counted, the counting results are standardized by using the sentence number of the long text. Similar sentence pairs are aggregated into a graph by their source. Each long text represents a node on the graph, and the weight of an edge is the number of similar sentence pairs between two long texts.
4) And performing information transmission and aggregation operation twice on the graph, and updating the node characteristics after fusing group information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The invention is further illustrated by the following examples.
Example 1
Regarding the electronic book with N text formats as N long textsThe method provided by the invention is used for calculating the text similarity between every two long texts. The method comprises two stages, a similar sentence detection stage and a graph structure calculation stage (as shown in fig. 1).
1) Before detecting similar sentences, a sentence vector extraction model is constructed and used for converting texts into sentence vectors. Firstly, all long texts are segmented into sentences; then, a sentence vector extraction model is obtained through a comparison learning fine-tuning pre-trained language Representation model BERT Bidirectional Encoder reproduction from transformations model (or RoBERTA model); converting the text sentence into a sentence vector through a sentence vector extraction model;
11) and finely adjusting the BERT model by performing sentence semantic similarity comparison learning training to obtain a semantic similarity detection model.
For each segmented sentence, first, a sentence vector is extracted from the sentence . For sentence vector extracted from sentenceA positive example of the comparative learning of the sentence is constructed by performing a drop method (Dropout) process, and vectors extracted from the text of other sentences in a training batch are taken as negative examples of the comparative learning of the sentence. The training loss function adopts a loss function calculated based on positive and negative examples of Sentence vectors and constructions, and the design is the same as that of SimCSE (Simple contrast Learning of sequence Embeddings). Naming the trained model as a semantic similarity detection model, and recording as。
12) And (4) finely adjusting the BERT model by carrying out sentence transfer similarity comparison learning training to obtain a transfer similarity detection model.
For each sentence, first, a sentence vector is extracted from the sentence. The penalty function for the BERT model fine tuning comprisesAnd withTwo parts.Andthe loss function of (2) is the same. In the calculation ofAnd then, for each sentence, dividing the sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text. Sentence vector extracted from new sentence textPerforming Dropout processing to construct a positive example of contrast learning of the sentence, extracting vectors from text of other sentences in a training batch As a negative example of comparative learning of this sentence.Is based on sentence vectorsAnd positive examples of constructionAnd negative exampleThe design of the calculated loss function is the same as that of sentence embedding simple contrast learning SimCSE. The loss function for the final training is:
wherein,is a hyper-parameter that needs to be set and which adjusts the degree of emphasis between model versus sentence structural reorganization and semantic differences. The obtained model is named as a similar detection model and is recorded as。
2) Similar sentence pairs are detected among the long texts T (as shown in fig. 2, three similar sentence pairs are detected by designing a detection method for the three similar sentence pairs in specific implementation).
A. When detecting the semantically similar sentence pair, the following operations are executed:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as;
A3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less than The sentence pair is a semantic similarity sentence pair;
B. detecting sentence pairs which are converted into similar sentences, and executing the following operations:
B1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as;
B3. Feature vector to sentenceDe-repeating to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are pair-counted as;
B4. Calculating outThe t percentile of the medium vector distance is used as a similarity threshold value;
B5. Filtering outDistance of medium feature vector is less thanThe sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value ;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs are locally similar.
After the detected similar sentence pair result is obtained, the graph structure calculation stage is entered (as shown in fig. 3).
3) And merging and summarizing the detection results of the three types of similar sentences, and then carrying out standardization processing on numerical values according to the total length of the text to obtain the basic similarity of the long text. In particular, assume that there are two long texts,Detecting to detectAndinEach sentence is similar (including three types of similarity), and the total number of sentences in the two long texts isAndthe basic similarity of the two long textsCalculated as follows:
4) abstract representation of long text and basic similarity thereof into similar sentence relation graph. Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; for long textThere are feature vectors:
if two long texts,There is a similar sentence between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge has:
5) Carrying out information transmission and aggregation operation twice on the relational graph to obtain new node characteristic informationAnd updated. The calculation method is as follows:
wherein,the weights are self-defined by the first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images. Finally, the node feature vector is obtainedWherein, in the process,represents a long textAnd long textThe text similarity of (2).
The method is adopted to calculate the similarity of the long text, the long text is divided into fine-grained sentences for coding and comparison, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and propagating and aggregating the information on the graph, wherein the nodes represent the fused group information; similar sentences can be visually checked, so that long texts have strong interpretability, and the effectiveness and the precision of text processing are improved.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (9)
1. A two-stage method for calculating similarity of long text is characterized in that,
in a first stage, a similar sentence detection stage, comprising:
11) constructing a sentence vector extraction model based on a deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transcription similarity detection model;
12) converting the text into sentence vectors through the sentence vector extraction model, and detecting by adopting a plurality of detection methods to obtain a plurality of similar sentence pairs of similar types between each long text, wherein the steps of: semantic similar sentence pairs, rephrased similar sentence pairs and local similar sentence pairs;
in the second stage, the graph structure calculation stage includes:
21) calculating to obtain basic similarity;
22) constructing a similar sentence relation graph structure according to the long text similar sentence pair and the basic similarity; each node on the similar sentence relation graph represents a long text; edges between the nodes represent that similar sentences exist between two long texts corresponding to the nodes;
23) performing information transmission and aggregation operation twice on the similar sentence relation graph through the operation of the similar sentence relation graph to obtain high-level node representation of the fusion group information, thereby obtaining and updating new node characteristic information;
The value of each dimension on the node feature vector is the text similarity between corresponding long texts; and according to the node characteristics, obtaining the text similarity between the long texts.
2. The two-stage method for calculating similarity of long texts according to claim 1, wherein before the similar sentence detection stage, each long text is first divided into sentences; obtaining a sentence vector extraction model by comparing, learning, fine-tuning and pre-training a language representation model BERT model or RoBERTA model; and respectively extracting sentence vectors of long text sentences and clauses through a semantic similarity detection model and a transfer similarity detection model included by the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
3. The two-stage long text similarity calculation method according to claim 2, further comprising obtaining a sentence vector extraction model by:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
The loss function used for training adopts a loss function calculated on the basis of a positive example and a negative example of a sentence vector sum structure;
naming the trained model as a semantic similarity detection model;
12) performing sentence transferring similarity comparison learning training, and fine-tuning a BERT model to obtain a transferring similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence text to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the penalty function for the BERT model fine tuning comprisesAnd;the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function;
Final loss functionComprises the following steps:(ii) a Wherein,the hyper-parameters are needed to be set and are used for adjusting the degree of emphasis between sentence structure recombination and semantic difference of the model;
the obtained model is named as a transfer similar detection model.
4. The two-stage method for calculating similarity of long texts according to claim 1, wherein the plurality of detection methods in the first stage comprise three methods for detecting similar sentence pairs, such as semantic similar sentence pair, rephrase similar sentence pair and local similar sentence pair.
5. The two-stage long text similarity calculation method according to claim 4, further comprising, in the step of,
A. when detecting the semantically similar sentence pair, the following operations are executed:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and recording the feature vectors as;
A3. Feature vector to sentenceDe-repeating to obtain(ii) a For each feature vector, finding TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. Calculating outThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less thanThe sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as;
B3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
B4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value ;
B5. Filtering out ofDistance of medium feature vector is less thanThe sentence pair is a similar sentence pair;
C. and detecting sentence pairs of local similarity, and executing the following operations:
C1. each long text is combinedAfter the punctuation marks representing sentence division are divided into sentences, the sentences are divided into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model, and recording the feature vectors as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
6. The two-stage method for calculating similarity of long texts as claimed in claim 5, wherein after the three types of similar sentences are merged and collected to the detection result, the value is standardized according to the total length of the text to obtain the basic similarity of the long text.
7. The two-stage method for calculating similarity of long texts as claimed in claim 6, further wherein the calculating of the basic similarity is:
Is provided with two long texts,Detecting to detectAndinIf each sentence is similar, the basic similarity of two long textsCalculated as follows:
8. The two-stage method for calculating similarity of long texts as claimed in claim 7, further comprising representing the long texts and their basic similarities as a relation graph of similar sentences(ii) a Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts,If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
9. The two-stage long text similarity calculation method according to claim 8, further comprising performing information transmission and aggregation operation twice on the relationship graph to obtain new node feature informationAnd updating; the calculation method is as follows:
wherein,the weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images; 、Is shown as a drawingUpper node、The feature vector value after the first updating; finally, the node feature vector is obtainedWherein, in the process,represents a long textAnd long textThe text similarity of (2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210298133.6A CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210298133.6A CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114398867A CN114398867A (en) | 2022-04-26 |
CN114398867B true CN114398867B (en) | 2022-06-28 |
Family
ID=81234598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210298133.6A Active CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114398867B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688138B (en) * | 2024-02-02 | 2024-04-09 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN113486645A (en) * | 2021-06-08 | 2021-10-08 | 浙江华巽科技有限公司 | Text similarity detection method based on deep learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9892111B2 (en) * | 2006-10-10 | 2018-02-13 | Abbyy Production Llc | Method and device to estimate similarity between documents having multiple segments |
-
2022
- 2022-03-25 CN CN202210298133.6A patent/CN114398867B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN113486645A (en) * | 2021-06-08 | 2021-10-08 | 浙江华巽科技有限公司 | Text similarity detection method based on deep learning |
Non-Patent Citations (2)
Title |
---|
Miguel Feria等.Constructing a Word Similarity Graph from Vector based Word Representation for Named Entity Recognition.《arXiv》.2018,第1-6页. * |
王帅等.TP-AS:一种面向长文本的两阶段自动摘要方法.《中文信息学报》.2018,第32卷(第06期),第71-79页. * |
Also Published As
Publication number | Publication date |
---|---|
CN114398867A (en) | 2022-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918666B (en) | Chinese punctuation mark adding method based on neural network | |
RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
CN106909655B (en) | The knowledge mapping entity discovery excavated based on production alias and link method | |
CN101727462B (en) | Method and device for generating Chinese comparative sentence sorter model and identifying Chinese comparative sentences | |
CN115393692A (en) | Generation formula pre-training language model-based association text-to-image generation method | |
CN111027595A (en) | Double-stage semantic word vector generation method | |
CN110222172B (en) | Multi-source network public opinion theme mining method based on improved hierarchical clustering | |
CN112231477A (en) | Text classification method based on improved capsule network | |
Reyes-Galaviz et al. | A supervised gradient-based learning algorithm for optimized entity resolution | |
CN107180026A (en) | The event phrase learning method and device of a kind of word-based embedded Semantic mapping | |
CN114398867B (en) | Two-stage long text similarity calculation method | |
CN114332519A (en) | Image description generation method based on external triple and abstract relation | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Ghosh | Sentiment analysis of IMDb movie reviews: A comparative study on performance of hyperparameter-tuned classification algorithms | |
CN113051886B (en) | Test question duplicate checking method, device, storage medium and equipment | |
CN112417082B (en) | Scientific research achievement data disambiguation filing storage method | |
Papagiannopoulou et al. | Keyword extraction using unsupervised learning on the document’s adjacency matrix | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
CN106202116B (en) | Text classification method and system based on rough set and KNN | |
CN116821371A (en) | Method for generating scientific abstracts of multiple documents by combining and enhancing topic knowledge graphs | |
CN115906824A (en) | Text fine-grained emotion analysis method, system, medium and computing equipment | |
Yafooz et al. | Enhancing multi-class web video categorization model using machine and deep learning approaches | |
CN114153977A (en) | Abnormal data detection method and system | |
CN114238586A (en) | Emotion classification method of Bert combined convolutional neural network based on federated learning framework | |
CN113111288A (en) | Web service classification method fusing unstructured and structured information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |