CN114398867A - Two-stage long text similarity calculation method - Google Patents
Two-stage long text similarity calculation method Download PDFInfo
- Publication number
- CN114398867A CN114398867A CN202210298133.6A CN202210298133A CN114398867A CN 114398867 A CN114398867 A CN 114398867A CN 202210298133 A CN202210298133 A CN 202210298133A CN 114398867 A CN114398867 A CN 114398867A
- Authority
- CN
- China
- Prior art keywords
- sentence
- similarity
- text
- long
- similar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a two-stage long text similarity calculation method, wherein in a first stage similar sentence detection stage, a sentence vector extraction model is constructed based on a deep learning model, and a text is converted into a sentence vector; detecting to obtain a plurality of similar sentence pairs with similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.
Description
Technical Field
The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
Background
Text similarity calculation is an important task of natural language processing, and related technologies aim to measure the degree of similarity between texts by using technical means. For texts with different lengths, different text similarity calculation methods need to be adapted. When the similarity of long texts is calculated, a large amount of text information needs to be extracted, compressed and matched for calculation, and the method has important application in the aspects of news recommendation, article recommendation, quotation recommendation, document clustering and the like.
In the prior art, a method based on keyword extraction is mostly adopted, a few keywords are extracted to be used as representatives of a long text, and then the keywords participate in further similarity calculation. Since the calculation result depends on a few keywords, the method loses a large amount of semantic information and has poor robustness.
The deep learning model-based method uses the deep learning model to encode the full text and then calculates the similarity of the full text. However, the existing deep learning model can only achieve a good coding effect on a text sequence with a length of hundreds of words. While long texts like books often have tens of thousands of characters or even hundreds of thousands of characters, the existing models cannot code well. Moreover, since the similarity calculation is performed in a hidden space, the interpretability is poor.
In addition, the two technologies only consider the information between the compared long texts, and the calculation process is relatively isolated and lacks of utilization of group information.
Disclosure of Invention
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
The principle of the invention is as follows: for a group of long texts, detecting to obtain similar sentence pairs between each long text by using a plurality of detection methods in a first stage; in the second stage, the similar sentence pairs are combined and summarized according to the long texts where the similar sentences are located, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.
The technical scheme provided by the invention is as follows:
a two-stage long text similarity calculation method comprises the following steps:
in a first stage, a similar sentence detection stage, comprising:
constructing a sentence vector extraction model based on the deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;
converting the text into sentence vectors through a sentence vector extraction model;
detecting to obtain a plurality of similar sentence pairs of similar types between each long text by using a plurality of detection methods;
in a second stage graph structure computation stage, comprising:
calculating to obtain basic similarity;
based on a graph algorithm, expressing the long text similar sentence pairs and the basic similarity into a similar sentence relation graph; each node on the similar sentence relation graph represents a long text;
obtaining high-level node representation of the fusion group information through inference interactive operation of a similar sentence relation graph;
updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts;
and obtaining the text similarity between the long texts according to the node characteristics.
Further, before the similar sentence detection stage, the two-stage long text similarity calculation method firstly divides each long text into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of the long text sentences and the clauses through a semantic similarity detection model and a transfer similarity detection model which are included in the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
Further, a sentence vector extraction model is obtained by the following steps:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
the loss function used for training adopts a loss function calculated on the basis of positive and negative examples of sentence vector sum construction;
the trained model is named as a semantic similarity detection model;
12) performing sentence transfer similarity comparison learning training, and fine-tuning a BERT model to obtain a transfer similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the loss function for the BERT model fine tuning comprisesAnd;the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function;
Final loss functionComprises the following steps:(ii) a Wherein the content of the first and second substances,is to requireThe hyper-parameters to be set are used for adjusting the degree of emphasis between sentence structure recombination and semantic difference of the model;
the obtained model is named as a transfer similar detection model.
Furthermore, the multiple detection methods in the first stage comprise three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similar sentence pairs, rephrased similar sentence pairs, and local similar sentence pairs.
A. And when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as;
A3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less thanThe sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as;
B3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
B4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
B5. Filtering outDistance of medium feature vector is less thanThe sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
And further, combining and summarizing the detection results by using three types of similar sentences, and then standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text.
Further, the calculation of the basic similarity is:
is provided with two long texts,Detect thatAndin (1)If the sentences are similar, the basic similarity of the two long textsCalculated as follows:
wherein the content of the first and second substances,andthe total number of sentences in the two long texts respectively.
Further, the long text and the basic similarity thereof are expressed into a similar sentence relation graph(ii) a Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts,If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
Further, information transmission and aggregation operation are carried out twice on the relational graph to obtain new node characteristic informationAnd updating; the calculation method is as follows:
wherein the content of the first and second substances,、is shown in the figureUpper node、An initial feature vector value of;the weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images;、is shown as a drawingUpper node、The characteristic vector value after the first updating; finally obtainTo-node feature vectorWherein, in the step (A),represents a long textAnd long textThe text similarity of (2).
Compared with the prior art, the invention has the beneficial effects that:
by utilizing the technical scheme provided by the invention, when the similarity of the long text is calculated, the long text is divided into fine-grained sentences for coding and comparing, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and enabling the nodes to represent and fuse group information through information propagation and aggregation on the graph; meanwhile, similar sentences can be visually checked, the similarity calculation method of the long text provided by the invention can enable the similarity of the long text to have stronger interpretability, and the effectiveness and the precision of text processing are improved.
Drawings
Fig. 1 is a block diagram of a two-stage process for calculating similarity of long texts according to the present invention.
FIG. 2 is a block flow diagram of the similar sentence detection stage of the method of the present invention.
FIG. 3 is a flow diagram of the graph structure calculation phase of the method of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm. For a group of long texts, in a first stage, detecting similar sentence pairs between each long text by using a plurality of detection paths; in the second stage, the matched sentence pairs are aggregated into a graph according to the source of the sentence pairs, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.
Fig. 1 shows a process of calculating similarity of a long text based on two stages of a deep learning model and a graph algorithm according to the present invention. The method comprises the following steps:
the first stage is a similar sentence detection stage:
1) and constructing a sentence vector extraction model based on a deep learning model (a BERT model or a RoBERTA model can be used), and extracting a sentence vector for sentences and clauses in the long text.
2) And detecting to obtain various types of similar sentence pairs according to the similarity of the sentence vectors.
The second phase is the graph structure calculation phase:
3) the similar sentence pairs are constructed into a graph structure according to the source (in long text).
4) And carrying out information transmission and aggregation operation on the graph, and updating the node characteristic information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The method comprises the following concrete implementation steps:
1) the long text is divided into sentences according to punctuations representing the sentence division, and is divided into clauses according to commas inside the sentences. Sentence vectors of sentences and sub-sentences are respectively extracted by using sentence vector extraction models (including a semantic similarity detection model and a transcription similarity detection model) which are adjusted by contrast learning.
2) And detecting similar sentence vectors according to the distance of the sentence vectors aiming at three types of sentence text similar modes of semantic similar type, rephrase similar type and local similar type to obtain corresponding similar sentence pairs.
3) And after the detection results of the three types of similar sentences are merged and counted, the counting results are standardized by using the sentence number of the long text. Similar sentence pairs are aggregated into a graph by their source. Each long text represents a node on the graph, and the weight of an edge is the number of similar sentence pairs between two long texts.
4) And performing information transmission and aggregation operation twice on the graph, and updating the node characteristics after fusing group information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The invention is further illustrated by the following examples.
Example 1
Regarding the electronic book with N text formats as N long textsThe method provided by the invention is used for calculating the text similarity between every two long texts. The method comprises two stages, a similar sentence detection stage and a graph structure calculation stage (as shown in fig. 1).
1) Before similar sentence detection, a sentence vector extraction model is constructed and obtained for converting texts into sentence vectors. Firstly, all long texts are divided into sentences; then, a sentence vector extraction model is obtained through a comparison learning fine-tuning pre-trained language Representation model BERT Bidirectional Encoder reproduction from transformations model (or RoBERTA model); converting the text sentence into a sentence vector through a sentence vector extraction model;
11) and finely adjusting the BERT model by performing sentence semantic similarity comparison learning training to obtain a semantic similarity detection model.
For each segmented sentence, first, a sentence vector is extracted from the sentence. For sentence vector extracted from sentenceA positive example of the comparative learning of the sentence is constructed by performing a drop out process, with vectors extracted from the text of other sentences in a training batch as negative examples of the comparative learning of the sentence. The training loss function is calculated based on positive and negative examples of Sentence vector sum construction, and the design is the same as that of SimCSE (Simple contrast Learning of sequence Embeddings). The trained model is named as a semantic similarity detection model and is recorded as。
12) And (4) finely adjusting the BERT model by carrying out sentence transfer similarity comparison learning training to obtain a transfer similarity detection model.
For each sentence, first, a sentence vector is extracted from the sentence. The loss function for the BERT model fine tuning comprisesAndtwo parts.Andthe loss function of (2) is the same. In the calculation ofAnd then, for each sentence, dividing the sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text. Sentence vector extracted from new sentence textPerforming Dropout processing to construct a positive example of contrast learning of the sentence, extracting vectors from text of other sentences in a training batchAs a negative example of comparative learning of the sentence.Is based on sentence vectorsAnd positive examples of constructionAnd negative examplesThe design of the calculated loss function is the same as that of sentence embedding simple contrast learning SimCSE. The loss function of the final training is:
wherein the content of the first and second substances,is a hyper-parameter that needs to be set that adjusts the degree of emphasis between model versus sentence structural reorganization and semantic differences. The obtained model is named as a transfer similarity detection model and is recorded as。
2) Similar sentence pairs are detected among the long texts T (as shown in fig. 2, three similar sentence pairs are detected by designing a detection method for the three similar sentence pairs in specific implementation).
A. And when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as;
A3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less thanThe sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combinedObject segmented by representation sentenceDividing the point symbols into sentences;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as;
B3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
B4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
B5. Filtering outDistance of medium feature vector is less thanThe sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. using semantic similarity detection model extractionTaking the feature vectors of all clauses and recording as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
After the detected similar sentence pair result is obtained, the graph structure calculation stage is entered (as shown in fig. 3).
3) And combining and summarizing the detection results of the three types of similar sentences, and standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text. Specifically, assume that there are two long texts,Detect thatAndin (1)Each sentence is similar (including three types of similarity), and the total number of sentences in the two long texts isAndthe basic similarity of the two long textsCalculated as follows:
4) abstract representation of long text and basic similarity thereof into similar sentence relation graph. Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; for long textThere are feature vectors:
if two long texts,There is a similar sentence between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge has:
5) carrying out information transmission and aggregation operation twice on the relational graph to obtain new node characteristic informationAnd updated. The calculation method is as follows:
wherein the content of the first and second substances,the weights are self-defined by the first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images. Finally, the node feature vector is obtainedWherein, in the step (A),represents a long textAnd long textThe text similarity of (2).
The method is adopted to calculate the similarity of the long text, the long text is divided into fine-grained sentences for coding and comparison, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and propagating and aggregating the information on the graph, wherein the nodes represent the fused group information; similar sentences can be visually checked, so that long texts have strong interpretability, and the effectiveness and the precision of text processing are improved.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (9)
1. A two-stage method for calculating the similarity of long texts is characterized in that,
in a first stage, a similar sentence detection stage, comprising:
11) constructing a sentence vector extraction model based on a deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;
12) converting the text into sentence vectors through the sentence vector extraction model, and detecting to obtain multiple similar sentence pairs of similar types between each long text by adopting multiple detection methods, wherein the method comprises the following steps: semantic similar sentence pairs, rephrased similar sentence pairs and local similar sentence pairs;
in a second stage graph structure computation stage, comprising:
21) calculating to obtain basic similarity;
22) constructing a similar sentence relation graph structure according to the long text similar sentence pairs and the basic similarity; each node on the similar sentence relation graph represents a long text; edges between the nodes represent that similar sentences exist between two long texts corresponding to the nodes;
23) performing information transmission and aggregation operation twice on the similar sentence relation graph through the operation of the similar sentence relation graph to obtain high-level node representation of the fusion group information, thereby obtaining and updating new node characteristic information;
the value of each dimension on the node feature vector corresponds to the text similarity between the long texts; and obtaining the text similarity between the long texts according to the node characteristics.
2. The two-stage method for calculating similarity of long texts according to claim 1, wherein before the similar sentence detection stage, each long text is first divided into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of long text sentences and clauses through a semantic similarity detection model and a transfer similarity detection model included by the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
3. The two-stage long text similarity calculation method according to claim 2, further comprising obtaining a sentence vector extraction model by:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
the loss function used for training adopts a loss function calculated on the basis of positive and negative examples of sentence vector sum construction;
the trained model is named as a semantic similarity detection model;
12) performing sentence transfer similarity comparison learning training, and fine-tuning a BERT model to obtain a transfer similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the loss function for the BERT model fine tuning comprisesAnd;the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function;
Final loss functionComprises the following steps:(ii) a Wherein the content of the first and second substances,is a hyper-parameter to be set for adjusting the sentence-structure reorganization and the semantic difference of the modelThe degree of weight between the differences;
the obtained model is named as a transfer similar detection model.
4. The two-stage method for calculating similarity of long texts according to claim 1, wherein the plurality of detection methods in the first stage comprise three methods for detecting similar sentence pairs, such as semantic similar sentence pair, rephrase similar sentence pair and local similar sentence pair.
5. The two-stage long text similarity calculation method according to claim 4, further comprising,
A. and when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as;
A3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as;
A4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
A5. Filtering outDistance of medium feature vector is less thanThe sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as;
B3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
B4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
B5. Filtering outDistance of medium feature vector is less thanThe sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combinedDividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as;
C3. Feature vector to sentenceDe-duplication to obtain(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into;
C4. ComputingThe t percentile of the medium vector distance is used as a similarity threshold value;
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
6. The two-stage method for calculating similarity of long texts as claimed in claim 5, wherein after the three types of similar sentences are merged and collected to the detection result, the value is standardized according to the total length of the text to obtain the basic similarity of the long text.
7. The two-stage method for calculating similarity of long texts as claimed in claim 6, further wherein the calculating of the basic similarity is:
is provided with two long texts,Detect thatAndin (1)If the sentences are similar, the basic similarity of the two long textsCalculated as follows:
8. The two-stage method for calculating similarity of long texts as claimed in claim 7, further comprising representing the long texts and their basic similarities as a relation graph of similar sentences(ii) a Each node in the similar sentence relationship graphRepresents a long piece of text that is displayed in a long format,(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts,If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
9. The two-stage long text similarity calculation method according to claim 8, further comprising performing information transmission and aggregation operation twice on the relationship graph to obtain new node feature informationAnd updating; the calculation method is as follows:
wherein the content of the first and second substances,are self-defined weights for the first and second operations, respectively, for adjusting the weights twiceScale of aggregation of information on the graph;、is shown as a drawingUpper node、The characteristic vector value after the first updating; finally, the node feature vector is obtainedWherein, in the step (A),represents a long textAnd long textThe text similarity of (2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210298133.6A CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210298133.6A CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114398867A true CN114398867A (en) | 2022-04-26 |
CN114398867B CN114398867B (en) | 2022-06-28 |
Family
ID=81234598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210298133.6A Active CN114398867B (en) | 2022-03-25 | 2022-03-25 | Two-stage long text similarity calculation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114398867B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688138A (en) * | 2024-02-02 | 2024-03-12 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN113486645A (en) * | 2021-06-08 | 2021-10-08 | 浙江华巽科技有限公司 | Text similarity detection method based on deep learning |
-
2022
- 2022-03-25 CN CN202210298133.6A patent/CN114398867B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
CN110196906A (en) * | 2019-01-04 | 2019-09-03 | 华南理工大学 | Towards financial industry based on deep learning text similarity detection method |
CN113486645A (en) * | 2021-06-08 | 2021-10-08 | 浙江华巽科技有限公司 | Text similarity detection method based on deep learning |
Non-Patent Citations (2)
Title |
---|
MIGUEL FERIA等: "Constructing a Word Similarity Graph from Vector based Word Representation for Named Entity Recognition", 《ARXIV》, 9 July 2018 (2018-07-09), pages 1 - 6 * |
王帅等: "TP-AS:一种面向长文本的两阶段自动摘要方法", 《中文信息学报》, vol. 32, no. 06, 30 June 2018 (2018-06-30), pages 71 - 79 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688138A (en) * | 2024-02-02 | 2024-03-12 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
CN117688138B (en) * | 2024-02-02 | 2024-04-09 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
Also Published As
Publication number | Publication date |
---|---|
CN114398867B (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
Necşulescu et al. | Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships | |
CN105512277B (en) | A kind of short text clustering method towards Book Market title | |
CN110222172B (en) | Multi-source network public opinion theme mining method based on improved hierarchical clustering | |
CN115393692A (en) | Generation formula pre-training language model-based association text-to-image generation method | |
CN101727462B (en) | Method and device for generating Chinese comparative sentence sorter model and identifying Chinese comparative sentences | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN103473380A (en) | Computer text sentiment classification method | |
CN114398867B (en) | Two-stage long text similarity calculation method | |
CN114564563A (en) | End-to-end entity relationship joint extraction method and system based on relationship decomposition | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Padmasundari et al. | Intent discovery through unsupervised semantic text clustering | |
CN112417082B (en) | Scientific research achievement data disambiguation filing storage method | |
Ghosh | Sentiment analysis of IMDb movie reviews: a comparative study on performance of hyperparameter-tuned classification algorithms | |
Shounak et al. | Reddit comment toxicity score prediction through bert via transformer based architecture | |
CN110674293B (en) | Text classification method based on semantic migration | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
Bergelid | Classification of explicit music content using lyrics and music metadata | |
CN108920475B (en) | Short text similarity calculation method | |
Yafooz et al. | Enhancing multi-class web video categorization model using machine and deep learning approaches | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
CN111899832B (en) | Medical theme management system and method based on context semantic analysis | |
Garg et al. | Identification of relations from IndoWordNet for indian languages using support vector machine | |
Jamil et al. | Topic identification method for textual document | |
CN113111288A (en) | Web service classification method fusing unstructured and structured information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |