CN114398867A - Two-stage long text similarity calculation method - Google Patents

Two-stage long text similarity calculation method Download PDF

Info

Publication number
CN114398867A
CN114398867A CN202210298133.6A CN202210298133A CN114398867A CN 114398867 A CN114398867 A CN 114398867A CN 202210298133 A CN202210298133 A CN 202210298133A CN 114398867 A CN114398867 A CN 114398867A
Authority
CN
China
Prior art keywords
sentence
similarity
text
long
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210298133.6A
Other languages
Chinese (zh)
Other versions
CN114398867B (en
Inventor
段思宇
苏祺
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202210298133.6A priority Critical patent/CN114398867B/en
Publication of CN114398867A publication Critical patent/CN114398867A/en
Application granted granted Critical
Publication of CN114398867B publication Critical patent/CN114398867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a two-stage long text similarity calculation method, wherein in a first stage similar sentence detection stage, a sentence vector extraction model is constructed based on a deep learning model, and a text is converted into a sentence vector; detecting to obtain a plurality of similar sentence pairs with similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.

Description

Two-stage long text similarity calculation method
Technical Field
The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
Background
Text similarity calculation is an important task of natural language processing, and related technologies aim to measure the degree of similarity between texts by using technical means. For texts with different lengths, different text similarity calculation methods need to be adapted. When the similarity of long texts is calculated, a large amount of text information needs to be extracted, compressed and matched for calculation, and the method has important application in the aspects of news recommendation, article recommendation, quotation recommendation, document clustering and the like.
In the prior art, a method based on keyword extraction is mostly adopted, a few keywords are extracted to be used as representatives of a long text, and then the keywords participate in further similarity calculation. Since the calculation result depends on a few keywords, the method loses a large amount of semantic information and has poor robustness.
The deep learning model-based method uses the deep learning model to encode the full text and then calculates the similarity of the full text. However, the existing deep learning model can only achieve a good coding effect on a text sequence with a length of hundreds of words. While long texts like books often have tens of thousands of characters or even hundreds of thousands of characters, the existing models cannot code well. Moreover, since the similarity calculation is performed in a hidden space, the interpretability is poor.
In addition, the two technologies only consider the information between the compared long texts, and the calculation process is relatively isolated and lacks of utilization of group information.
Disclosure of Invention
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.
The principle of the invention is as follows: for a group of long texts, detecting to obtain similar sentence pairs between each long text by using a plurality of detection methods in a first stage; in the second stage, the similar sentence pairs are combined and summarized according to the long texts where the similar sentences are located, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.
The technical scheme provided by the invention is as follows:
a two-stage long text similarity calculation method comprises the following steps:
in a first stage, a similar sentence detection stage, comprising:
constructing a sentence vector extraction model based on the deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;
converting the text into sentence vectors through a sentence vector extraction model;
detecting to obtain a plurality of similar sentence pairs of similar types between each long text by using a plurality of detection methods;
in a second stage graph structure computation stage, comprising:
calculating to obtain basic similarity;
based on a graph algorithm, expressing the long text similar sentence pairs and the basic similarity into a similar sentence relation graph; each node on the similar sentence relation graph represents a long text;
obtaining high-level node representation of the fusion group information through inference interactive operation of a similar sentence relation graph;
updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts;
and obtaining the text similarity between the long texts according to the node characteristics.
Further, before the similar sentence detection stage, the two-stage long text similarity calculation method firstly divides each long text into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of the long text sentences and the clauses through a semantic similarity detection model and a transfer similarity detection model which are included in the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
Further, a sentence vector extraction model is obtained by the following steps:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
the loss function used for training adopts a loss function calculated on the basis of positive and negative examples of sentence vector sum construction;
the trained model is named as a semantic similarity detection model;
12) performing sentence transfer similarity comparison learning training, and fine-tuning a BERT model to obtain a transfer similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the loss function for the BERT model fine tuning comprises
Figure 448480DEST_PATH_IMAGE001
And
Figure 515793DEST_PATH_IMAGE002
Figure 64586DEST_PATH_IMAGE003
the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function
Figure 808420DEST_PATH_IMAGE004
Final loss function
Figure 75453DEST_PATH_IMAGE005
Comprises the following steps:
Figure 71091DEST_PATH_IMAGE006
(ii) a Wherein the content of the first and second substances,
Figure 880915DEST_PATH_IMAGE007
is to requireThe hyper-parameters to be set are used for adjusting the degree of emphasis between sentence structure recombination and semantic difference of the model;
the obtained model is named as a transfer similar detection model.
Furthermore, the multiple detection methods in the first stage comprise three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similar sentence pairs, rephrased similar sentence pairs, and local similar sentence pairs.
A. And when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combined
Figure 139858DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as
Figure 815559DEST_PATH_IMAGE009
A3. Feature vector to sentence
Figure 552571DEST_PATH_IMAGE009
De-duplication to obtain
Figure 607115DEST_PATH_IMAGE010
(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as
Figure 36959DEST_PATH_IMAGE011
A4. Computing
Figure 685109DEST_PATH_IMAGE011
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 225812DEST_PATH_IMAGE012
A5. Filtering out
Figure 790654DEST_PATH_IMAGE009
Distance of medium feature vector is less than
Figure 656979DEST_PATH_IMAGE012
The sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combined
Figure 385901DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as
Figure 136819DEST_PATH_IMAGE013
B3. Feature vector to sentence
Figure 634797DEST_PATH_IMAGE014
De-duplication to obtain
Figure 672023DEST_PATH_IMAGE015
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 888240DEST_PATH_IMAGE016
B4. Computing
Figure 695047DEST_PATH_IMAGE017
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 47531DEST_PATH_IMAGE018
B5. Filtering out
Figure 990079DEST_PATH_IMAGE014
Distance of medium feature vector is less than
Figure 834538DEST_PATH_IMAGE018
The sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combined
Figure 255155DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as
Figure 524463DEST_PATH_IMAGE019
C3. Feature vector to sentence
Figure 372333DEST_PATH_IMAGE019
De-duplication to obtain
Figure 953356DEST_PATH_IMAGE020
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 177664DEST_PATH_IMAGE021
C4. Computing
Figure 176844DEST_PATH_IMAGE021
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 257932DEST_PATH_IMAGE022
C5. Filtering out
Figure 326251DEST_PATH_IMAGE019
Distance of medium feature vector is less than
Figure 88671DEST_PATH_IMAGE022
The clause pair of (1);
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
And further, combining and summarizing the detection results by using three types of similar sentences, and then standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text.
Further, the calculation of the basic similarity is:
is provided with two long texts
Figure 942357DEST_PATH_IMAGE008
Figure 397610DEST_PATH_IMAGE023
Detect that
Figure 828591DEST_PATH_IMAGE008
And
Figure 394701DEST_PATH_IMAGE023
in (1)
Figure 352162DEST_PATH_IMAGE024
If the sentences are similar, the basic similarity of the two long texts
Figure 978315DEST_PATH_IMAGE025
Calculated as follows:
Figure 896593DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 815DEST_PATH_IMAGE027
and
Figure 829094DEST_PATH_IMAGE028
the total number of sentences in the two long texts respectively.
Further, the long text and the basic similarity thereof are expressed into a similar sentence relation graph
Figure 626149DEST_PATH_IMAGE029
(ii) a Each node in the similar sentence relationship graph
Figure 153426DEST_PATH_IMAGE008
Represents a long piece of text that is displayed in a long format,
Figure 326919DEST_PATH_IMAGE030
(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts
Figure 868758DEST_PATH_IMAGE008
Figure 836714DEST_PATH_IMAGE023
If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
for long text
Figure 604950DEST_PATH_IMAGE008
There are feature vectors:
Figure 50975DEST_PATH_IMAGE031
two long texts
Figure 181742DEST_PATH_IMAGE008
Figure 976392DEST_PATH_IMAGE023
Weights of edges between corresponding nodes
Figure 559820DEST_PATH_IMAGE032
Figure 606273DEST_PATH_IMAGE025
Is the base similarity.
Further, information transmission and aggregation operation are carried out twice on the relational graph to obtain new node characteristic information
Figure 591547DEST_PATH_IMAGE033
And updating; the calculation method is as follows:
Figure 307830DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 378554DEST_PATH_IMAGE035
Figure 353332DEST_PATH_IMAGE036
is shown in the figure
Figure 193112DEST_PATH_IMAGE037
Upper node
Figure 204931DEST_PATH_IMAGE008
Figure 497372DEST_PATH_IMAGE023
An initial feature vector value of;
Figure 26573DEST_PATH_IMAGE038
the weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images;
Figure 642231DEST_PATH_IMAGE039
Figure 293792DEST_PATH_IMAGE040
is shown as a drawing
Figure 870267DEST_PATH_IMAGE037
Upper node
Figure 265476DEST_PATH_IMAGE008
Figure 486373DEST_PATH_IMAGE023
The characteristic vector value after the first updating; finally obtainTo-node feature vector
Figure 574415DEST_PATH_IMAGE041
Wherein, in the step (A),
Figure 372607DEST_PATH_IMAGE042
represents a long text
Figure 305928DEST_PATH_IMAGE008
And long text
Figure 899108DEST_PATH_IMAGE023
The text similarity of (2).
Compared with the prior art, the invention has the beneficial effects that:
by utilizing the technical scheme provided by the invention, when the similarity of the long text is calculated, the long text is divided into fine-grained sentences for coding and comparing, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and enabling the nodes to represent and fuse group information through information propagation and aggregation on the graph; meanwhile, similar sentences can be visually checked, the similarity calculation method of the long text provided by the invention can enable the similarity of the long text to have stronger interpretability, and the effectiveness and the precision of text processing are improved.
Drawings
Fig. 1 is a block diagram of a two-stage process for calculating similarity of long texts according to the present invention.
FIG. 2 is a block flow diagram of the similar sentence detection stage of the method of the present invention.
FIG. 3 is a flow diagram of the graph structure calculation phase of the method of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm. For a group of long texts, in a first stage, detecting similar sentence pairs between each long text by using a plurality of detection paths; in the second stage, the matched sentence pairs are aggregated into a graph according to the source of the sentence pairs, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.
Fig. 1 shows a process of calculating similarity of a long text based on two stages of a deep learning model and a graph algorithm according to the present invention. The method comprises the following steps:
the first stage is a similar sentence detection stage:
1) and constructing a sentence vector extraction model based on a deep learning model (a BERT model or a RoBERTA model can be used), and extracting a sentence vector for sentences and clauses in the long text.
2) And detecting to obtain various types of similar sentence pairs according to the similarity of the sentence vectors.
The second phase is the graph structure calculation phase:
3) the similar sentence pairs are constructed into a graph structure according to the source (in long text).
4) And carrying out information transmission and aggregation operation on the graph, and updating the node characteristic information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The method comprises the following concrete implementation steps:
1) the long text is divided into sentences according to punctuations representing the sentence division, and is divided into clauses according to commas inside the sentences. Sentence vectors of sentences and sub-sentences are respectively extracted by using sentence vector extraction models (including a semantic similarity detection model and a transcription similarity detection model) which are adjusted by contrast learning.
2) And detecting similar sentence vectors according to the distance of the sentence vectors aiming at three types of sentence text similar modes of semantic similar type, rephrase similar type and local similar type to obtain corresponding similar sentence pairs.
3) And after the detection results of the three types of similar sentences are merged and counted, the counting results are standardized by using the sentence number of the long text. Similar sentence pairs are aggregated into a graph by their source. Each long text represents a node on the graph, and the weight of an edge is the number of similar sentence pairs between two long texts.
4) And performing information transmission and aggregation operation twice on the graph, and updating the node characteristics after fusing group information.
The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.
The invention is further illustrated by the following examples.
Example 1
Regarding the electronic book with N text formats as N long texts
Figure 158051DEST_PATH_IMAGE043
The method provided by the invention is used for calculating the text similarity between every two long texts. The method comprises two stages, a similar sentence detection stage and a graph structure calculation stage (as shown in fig. 1).
1) Before similar sentence detection, a sentence vector extraction model is constructed and obtained for converting texts into sentence vectors. Firstly, all long texts are divided into sentences; then, a sentence vector extraction model is obtained through a comparison learning fine-tuning pre-trained language Representation model BERT Bidirectional Encoder reproduction from transformations model (or RoBERTA model); converting the text sentence into a sentence vector through a sentence vector extraction model;
11) and finely adjusting the BERT model by performing sentence semantic similarity comparison learning training to obtain a semantic similarity detection model.
For each segmented sentence, first, a sentence vector is extracted from the sentence
Figure 646801DEST_PATH_IMAGE044
. For sentence vector extracted from sentence
Figure 55917DEST_PATH_IMAGE044
A positive example of the comparative learning of the sentence is constructed by performing a drop out process, with vectors extracted from the text of other sentences in a training batch as negative examples of the comparative learning of the sentence. The training loss function is calculated based on positive and negative examples of Sentence vector sum construction, and the design is the same as that of SimCSE (Simple contrast Learning of sequence Embeddings). The trained model is named as a semantic similarity detection model and is recorded as
Figure 579302DEST_PATH_IMAGE045
12) And (4) finely adjusting the BERT model by carrying out sentence transfer similarity comparison learning training to obtain a transfer similarity detection model.
For each sentence, first, a sentence vector is extracted from the sentence
Figure 540305DEST_PATH_IMAGE044
. The loss function for the BERT model fine tuning comprises
Figure 781930DEST_PATH_IMAGE046
And
Figure 978425DEST_PATH_IMAGE047
two parts.
Figure 356317DEST_PATH_IMAGE001
And
Figure 629166DEST_PATH_IMAGE045
the loss function of (2) is the same. In the calculation of
Figure 92509DEST_PATH_IMAGE048
And then, for each sentence, dividing the sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text. Sentence vector extracted from new sentence text
Figure 233640DEST_PATH_IMAGE049
Performing Dropout processing to construct a positive example of contrast learning of the sentence, extracting vectors from text of other sentences in a training batch
Figure 731617DEST_PATH_IMAGE050
As a negative example of comparative learning of the sentence.
Figure 893477DEST_PATH_IMAGE048
Is based on sentence vectors
Figure 109695DEST_PATH_IMAGE044
And positive examples of construction
Figure 664304DEST_PATH_IMAGE051
And negative examples
Figure 16788DEST_PATH_IMAGE050
The design of the calculated loss function is the same as that of sentence embedding simple contrast learning SimCSE. The loss function of the final training is:
Figure 693757DEST_PATH_IMAGE052
wherein the content of the first and second substances,
Figure 662850DEST_PATH_IMAGE053
is a hyper-parameter that needs to be set that adjusts the degree of emphasis between model versus sentence structural reorganization and semantic differences. The obtained model is named as a transfer similarity detection model and is recorded as
Figure 349046DEST_PATH_IMAGE054
2) Similar sentence pairs are detected among the long texts T (as shown in fig. 2, three similar sentence pairs are detected by designing a detection method for the three similar sentence pairs in specific implementation).
A. And when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combined
Figure 477408DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as
Figure 590858DEST_PATH_IMAGE009
A3. Feature vector to sentence
Figure 47247DEST_PATH_IMAGE009
De-duplication to obtain
Figure 271555DEST_PATH_IMAGE010
(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as
Figure 270735DEST_PATH_IMAGE011
A4. Computing
Figure 555086DEST_PATH_IMAGE011
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 702033DEST_PATH_IMAGE012
A5. Filtering out
Figure 648474DEST_PATH_IMAGE009
Distance of medium feature vector is less than
Figure 830057DEST_PATH_IMAGE012
The sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combined
Figure 816467DEST_PATH_IMAGE008
Object segmented by representation sentenceDividing the point symbols into sentences;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as
Figure 716290DEST_PATH_IMAGE013
B3. Feature vector to sentence
Figure 954504DEST_PATH_IMAGE014
De-duplication to obtain
Figure 990594DEST_PATH_IMAGE015
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 272539DEST_PATH_IMAGE016
B4. Computing
Figure 394079DEST_PATH_IMAGE017
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 763880DEST_PATH_IMAGE018
B5. Filtering out
Figure 716793DEST_PATH_IMAGE014
Distance of medium feature vector is less than
Figure 513848DEST_PATH_IMAGE018
The sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combined
Figure 529208DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. using semantic similarity detection model extractionTaking the feature vectors of all clauses and recording as
Figure 437121DEST_PATH_IMAGE019
C3. Feature vector to sentence
Figure 369174DEST_PATH_IMAGE019
De-duplication to obtain
Figure 337130DEST_PATH_IMAGE020
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 433262DEST_PATH_IMAGE021
C4. Computing
Figure 676025DEST_PATH_IMAGE021
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 806792DEST_PATH_IMAGE022
C5. Filtering out
Figure 617753DEST_PATH_IMAGE019
Distance of medium feature vector is less than
Figure 935602DEST_PATH_IMAGE022
The clause pair of (1);
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
After the detected similar sentence pair result is obtained, the graph structure calculation stage is entered (as shown in fig. 3).
3) And combining and summarizing the detection results of the three types of similar sentences, and standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text. Specifically, assume that there are two long texts
Figure 372268DEST_PATH_IMAGE008
Figure 357542DEST_PATH_IMAGE023
Detect that
Figure 667300DEST_PATH_IMAGE008
And
Figure 3604DEST_PATH_IMAGE023
in (1)
Figure 57010DEST_PATH_IMAGE055
Each sentence is similar (including three types of similarity), and the total number of sentences in the two long texts is
Figure 568894DEST_PATH_IMAGE056
And
Figure 49554DEST_PATH_IMAGE028
the basic similarity of the two long texts
Figure 138733DEST_PATH_IMAGE025
Calculated as follows:
Figure 730251DEST_PATH_IMAGE057
4) abstract representation of long text and basic similarity thereof into similar sentence relation graph
Figure 614418DEST_PATH_IMAGE037
. Each node in the similar sentence relationship graph
Figure 265979DEST_PATH_IMAGE008
Represents a long piece of text that is displayed in a long format,
Figure 452241DEST_PATH_IMAGE030
(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; for long text
Figure 847450DEST_PATH_IMAGE008
There are feature vectors:
Figure 458560DEST_PATH_IMAGE058
if two long texts
Figure 281023DEST_PATH_IMAGE008
Figure 203848DEST_PATH_IMAGE023
There is a similar sentence between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge has:
Figure 402749DEST_PATH_IMAGE059
Figure 71627DEST_PATH_IMAGE025
is the basic similarity of the previous step of calculation;
5) carrying out information transmission and aggregation operation twice on the relational graph to obtain new node characteristic information
Figure 737095DEST_PATH_IMAGE033
And updated. The calculation method is as follows:
Figure 491424DEST_PATH_IMAGE060
wherein the content of the first and second substances,
Figure 290753DEST_PATH_IMAGE038
the weights are self-defined by the first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images. Finally, the node feature vector is obtained
Figure 814138DEST_PATH_IMAGE061
Wherein, in the step (A),
Figure 634196DEST_PATH_IMAGE042
represents a long text
Figure 875821DEST_PATH_IMAGE008
And long text
Figure 416524DEST_PATH_IMAGE023
The text similarity of (2).
The method is adopted to calculate the similarity of the long text, the long text is divided into fine-grained sentences for coding and comparison, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and propagating and aggregating the information on the graph, wherein the nodes represent the fused group information; similar sentences can be visually checked, so that long texts have strong interpretability, and the effectiveness and the precision of text processing are improved.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (9)

1. A two-stage method for calculating the similarity of long texts is characterized in that,
in a first stage, a similar sentence detection stage, comprising:
11) constructing a sentence vector extraction model based on a deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;
12) converting the text into sentence vectors through the sentence vector extraction model, and detecting to obtain multiple similar sentence pairs of similar types between each long text by adopting multiple detection methods, wherein the method comprises the following steps: semantic similar sentence pairs, rephrased similar sentence pairs and local similar sentence pairs;
in a second stage graph structure computation stage, comprising:
21) calculating to obtain basic similarity;
22) constructing a similar sentence relation graph structure according to the long text similar sentence pairs and the basic similarity; each node on the similar sentence relation graph represents a long text; edges between the nodes represent that similar sentences exist between two long texts corresponding to the nodes;
23) performing information transmission and aggregation operation twice on the similar sentence relation graph through the operation of the similar sentence relation graph to obtain high-level node representation of the fusion group information, thereby obtaining and updating new node characteristic information;
the value of each dimension on the node feature vector corresponds to the text similarity between the long texts; and obtaining the text similarity between the long texts according to the node characteristics.
2. The two-stage method for calculating similarity of long texts according to claim 1, wherein before the similar sentence detection stage, each long text is first divided into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of long text sentences and clauses through a semantic similarity detection model and a transfer similarity detection model included by the sentence vector extraction model, so that the long texts are converted into the sentence vectors.
3. The two-stage long text similarity calculation method according to claim 2, further comprising obtaining a sentence vector extraction model by:
11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:
processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;
taking other sentence vectors in a training batch as negative examples of comparative learning;
the loss function used for training adopts a loss function calculated on the basis of positive and negative examples of sentence vector sum construction;
the trained model is named as a semantic similarity detection model;
12) performing sentence transfer similarity comparison learning training, and fine-tuning a BERT model to obtain a transfer similarity detection model; the method comprises the following steps:
extracting sentence vectors from the sentence text;
dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;
the loss function for the BERT model fine tuning comprises
Figure 2896DEST_PATH_IMAGE001
And
Figure 575828DEST_PATH_IMAGE002
Figure 355565DEST_PATH_IMAGE003
the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function
Figure 422879DEST_PATH_IMAGE004
Final loss function
Figure 237251DEST_PATH_IMAGE005
Comprises the following steps:
Figure 590872DEST_PATH_IMAGE006
(ii) a Wherein the content of the first and second substances,
Figure 857905DEST_PATH_IMAGE007
is a hyper-parameter to be set for adjusting the sentence-structure reorganization and the semantic difference of the modelThe degree of weight between the differences;
the obtained model is named as a transfer similar detection model.
4. The two-stage method for calculating similarity of long texts according to claim 1, wherein the plurality of detection methods in the first stage comprise three methods for detecting similar sentence pairs, such as semantic similar sentence pair, rephrase similar sentence pair and local similar sentence pair.
5. The two-stage long text similarity calculation method according to claim 4, further comprising,
A. and when detecting the semantically similar sentence pair, executing the following operations:
A1. each long text is combined
Figure 978177DEST_PATH_IMAGE008
Dividing the sentence into sentences according to punctuations representing sentence division;
A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as
Figure 647055DEST_PATH_IMAGE009
A3. Feature vector to sentence
Figure 171578DEST_PATH_IMAGE010
De-duplication to obtain
Figure 925907DEST_PATH_IMAGE011
(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as
Figure 335023DEST_PATH_IMAGE012
A4. Computing
Figure 858408DEST_PATH_IMAGE013
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 206694DEST_PATH_IMAGE014
A5. Filtering out
Figure 448319DEST_PATH_IMAGE015
Distance of medium feature vector is less than
Figure 254601DEST_PATH_IMAGE016
The sentence pairs are semantic similarity sentence pairs;
B. detecting the sentence pairs which are converted into similar types, and executing the following operations:
B1. each long text is combined
Figure 898072DEST_PATH_IMAGE017
Dividing the sentence into sentences according to punctuations representing sentence division;
B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as
Figure 905343DEST_PATH_IMAGE018
B3. Feature vector to sentence
Figure 634264DEST_PATH_IMAGE019
De-duplication to obtain
Figure 368871DEST_PATH_IMAGE020
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 866848DEST_PATH_IMAGE021
B4. Computing
Figure 435233DEST_PATH_IMAGE022
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 385871DEST_PATH_IMAGE023
B5. Filtering out
Figure 674901DEST_PATH_IMAGE019
Distance of medium feature vector is less than
Figure 27385DEST_PATH_IMAGE024
The sentence pair is the sentence pair with similar type;
C. detecting sentence pairs of local similarity type, executing the following operations:
C1. each long text is combined
Figure 891305DEST_PATH_IMAGE025
Dividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;
C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as
Figure 329240DEST_PATH_IMAGE026
C3. Feature vector to sentence
Figure 281015DEST_PATH_IMAGE027
De-duplication to obtain
Figure 753585DEST_PATH_IMAGE028
(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into
Figure 539138DEST_PATH_IMAGE029
C4. Computing
Figure 464369DEST_PATH_IMAGE030
The t percentile of the medium vector distance is used as a similarity threshold value
Figure 610048DEST_PATH_IMAGE031
C5. Filtering out
Figure 671545DEST_PATH_IMAGE032
Distance of medium feature vector is less than
Figure 487054DEST_PATH_IMAGE033
The clause pair of (1);
C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.
6. The two-stage method for calculating similarity of long texts as claimed in claim 5, wherein after the three types of similar sentences are merged and collected to the detection result, the value is standardized according to the total length of the text to obtain the basic similarity of the long text.
7. The two-stage method for calculating similarity of long texts as claimed in claim 6, further wherein the calculating of the basic similarity is:
is provided with two long texts
Figure 899581DEST_PATH_IMAGE034
Figure 599684DEST_PATH_IMAGE035
Detect that
Figure 781267DEST_PATH_IMAGE036
And
Figure 970940DEST_PATH_IMAGE035
in (1)
Figure 401921DEST_PATH_IMAGE037
If the sentences are similar, the basic similarity of the two long texts
Figure 968031DEST_PATH_IMAGE038
Calculated as follows:
Figure 194001DEST_PATH_IMAGE039
wherein the content of the first and second substances,
Figure 554575DEST_PATH_IMAGE040
and
Figure 348219DEST_PATH_IMAGE041
the total number of sentences in the two long texts respectively.
8. The two-stage method for calculating similarity of long texts as claimed in claim 7, further comprising representing the long texts and their basic similarities as a relation graph of similar sentences
Figure 718020DEST_PATH_IMAGE042
(ii) a Each node in the similar sentence relationship graph
Figure 405353DEST_PATH_IMAGE034
Represents a long piece of text that is displayed in a long format,
Figure 202408DEST_PATH_IMAGE043
(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts
Figure 467036DEST_PATH_IMAGE008
Figure 640529DEST_PATH_IMAGE035
If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;
for long text
Figure 323314DEST_PATH_IMAGE044
There are feature vectors:
Figure 291270DEST_PATH_IMAGE045
two long texts
Figure 184140DEST_PATH_IMAGE046
Figure 895744DEST_PATH_IMAGE035
Weights of edges between corresponding nodes
Figure 416724DEST_PATH_IMAGE047
Figure 555581DEST_PATH_IMAGE048
Is the base similarity.
9. The two-stage long text similarity calculation method according to claim 8, further comprising performing information transmission and aggregation operation twice on the relationship graph to obtain new node feature information
Figure 811113DEST_PATH_IMAGE049
And updating; the calculation method is as follows:
Figure 60829DEST_PATH_IMAGE050
wherein the content of the first and second substances,
Figure 577261DEST_PATH_IMAGE051
are self-defined weights for the first and second operations, respectively, for adjusting the weights twiceScale of aggregation of information on the graph;
Figure 808391DEST_PATH_IMAGE052
Figure 879115DEST_PATH_IMAGE053
is shown as a drawing
Figure 932522DEST_PATH_IMAGE054
Upper node
Figure 303460DEST_PATH_IMAGE044
Figure 456224DEST_PATH_IMAGE055
The characteristic vector value after the first updating; finally, the node feature vector is obtained
Figure 748665DEST_PATH_IMAGE056
Wherein, in the step (A),
Figure 524204DEST_PATH_IMAGE057
represents a long text
Figure 218491DEST_PATH_IMAGE058
And long text
Figure 401210DEST_PATH_IMAGE059
The text similarity of (2).
CN202210298133.6A 2022-03-25 2022-03-25 Two-stage long text similarity calculation method Active CN114398867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210298133.6A CN114398867B (en) 2022-03-25 2022-03-25 Two-stage long text similarity calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210298133.6A CN114398867B (en) 2022-03-25 2022-03-25 Two-stage long text similarity calculation method

Publications (2)

Publication Number Publication Date
CN114398867A true CN114398867A (en) 2022-04-26
CN114398867B CN114398867B (en) 2022-06-28

Family

ID=81234598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210298133.6A Active CN114398867B (en) 2022-03-25 2022-03-25 Two-stage long text similarity calculation method

Country Status (1)

Country Link
CN (1) CN114398867B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN110196906A (en) * 2019-01-04 2019-09-03 华南理工大学 Towards financial industry based on deep learning text similarity detection method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIGUEL FERIA等: "Constructing a Word Similarity Graph from Vector based Word Representation for Named Entity Recognition", 《ARXIV》, 9 July 2018 (2018-07-09), pages 1 - 6 *
王帅等: "TP-AS:一种面向长文本的两阶段自动摘要方法", 《中文信息学报》, vol. 32, no. 06, 30 June 2018 (2018-06-30), pages 71 - 79 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division
CN117688138B (en) * 2024-02-02 2024-04-09 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division

Also Published As

Publication number Publication date
CN114398867B (en) 2022-06-28

Similar Documents

Publication Publication Date Title
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
Necşulescu et al. Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships
CN105512277B (en) A kind of short text clustering method towards Book Market title
CN110222172B (en) Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN115393692A (en) Generation formula pre-training language model-based association text-to-image generation method
CN101727462B (en) Method and device for generating Chinese comparative sentence sorter model and identifying Chinese comparative sentences
CN107357895B (en) Text representation processing method based on bag-of-words model
CN103473380A (en) Computer text sentiment classification method
CN114398867B (en) Two-stage long text similarity calculation method
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Padmasundari et al. Intent discovery through unsupervised semantic text clustering
CN112417082B (en) Scientific research achievement data disambiguation filing storage method
Ghosh Sentiment analysis of IMDb movie reviews: a comparative study on performance of hyperparameter-tuned classification algorithms
Shounak et al. Reddit comment toxicity score prediction through bert via transformer based architecture
CN110674293B (en) Text classification method based on semantic migration
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
Bergelid Classification of explicit music content using lyrics and music metadata
CN108920475B (en) Short text similarity calculation method
Yafooz et al. Enhancing multi-class web video categorization model using machine and deep learning approaches
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Garg et al. Identification of relations from IndoWordNet for indian languages using support vector machine
Jamil et al. Topic identification method for textual document
CN113111288A (en) Web service classification method fusing unstructured and structured information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant