CN114398867B

CN114398867B - Two-stage long text similarity calculation method

Info

Publication number: CN114398867B
Application number: CN202210298133.6A
Authority: CN
Inventors: 段思宇; 苏祺; 王军
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-28
Anticipated expiration: 2042-03-25
Also published as: CN114398867A

Abstract

The invention discloses a two-stage long text similarity calculation method, which comprises the steps of constructing a sentence vector extraction model based on a deep learning model in a first stage similar sentence detection stage, and converting a text into a sentence vector; detecting to obtain multiple similar sentence pairs of similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.

Description

Two-stage long text similarity calculation method

Technical Field

The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

Background

Text similarity calculation is an important task of natural language processing, and related technologies aim to measure the similarity between texts by using technical means. For texts with different lengths, different text similarity calculation methods need to be adapted. When the similarity of long texts is calculated, a large amount of text information needs to be extracted, compressed and matched for calculation, and the method has important application in the aspects of news recommendation, article recommendation, quotation recommendation, document clustering and the like.

In the prior art, a method based on keyword extraction is mostly adopted, a few keywords are extracted to serve as representatives of long texts, and then further similarity calculation is carried out. Since the calculation result depends on a few keywords, a great deal of semantic information is lost by the method, and the robustness is poor.

The deep learning model-based method uses a deep learning model to encode the full text and then calculates the similarity of the full text. However, the existing deep learning model can only achieve a good coding effect on text sequences with the length within hundreds of words. Long texts like books often have tens of thousands of characters or even hundreds of thousands of characters, and the existing models cannot code well. Moreover, since the similarity calculation is performed in a hidden space, the interpretability is poor.

In addition, the two technologies only consider information between compared long texts, and the calculation process is relatively isolated and lacks of utilization of group information.

Disclosure of Invention

The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

The principle of the invention is as follows: for a group of long texts, in the first stage, detecting by using a plurality of detection methods to obtain similar sentence pairs among each long text; in the second stage, similar sentences are combined and summarized according to the long texts where the similar sentences are located, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, obtaining the text similarity between the long texts according to the node characteristics.

The technical scheme provided by the invention is as follows:

a two-stage long text similarity calculation method comprises the following steps:

in a first stage, a similar sentence detection stage, comprising:

constructing a sentence vector extraction model based on the deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transcription similarity detection model;

Converting the text into sentence vectors through a sentence vector extraction model;

detecting to obtain similar sentence pairs of multiple similar types between each long text by using multiple detection methods;

in the second stage, the graph structure calculation stage includes:

calculating to obtain basic similarity;

based on a graph algorithm, representing the long text similar sentence pairs and the basic similarity into a similar sentence relation graph; each node on the similar sentence relation graph represents a long text;

obtaining high-level node representation of fusion group information through inference interactive operation of a similar sentence relation graph;

updating node feature information, wherein the value of each dimension on the node feature vector is the text similarity between corresponding long texts;

and according to the node characteristics, obtaining the text similarity between the long texts.

Further, before the similar sentence detection stage, the two-stage long text similarity calculation method firstly divides each long text into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of the long text sentences and the clauses through a semantic similarity detection model and a transfer similarity detection model which are included in the sentence vector extraction model, so that the long texts are converted into the sentence vectors.

Further, a sentence vector extraction model is obtained through the following steps:

11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:

processing the extracted sentence vectors by adopting a discarding method, and constructing a positive example of comparative learning;

taking other sentence vectors in a training batch as negative examples of comparative learning;

the loss function used for training adopts a loss function calculated on the basis of a positive example and a negative example of a sentence vector sum structure;

naming the trained model as a semantic similarity detection model;

12) performing sentence transferring similarity comparison learning training, and fine-tuning a BERT model to obtain a transferring similarity detection model; the method comprises the following steps:

extracting sentence vectors from the sentence text;

dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;

the loss function for the BERT model fine tuning comprises

And

；

the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function

；

Final loss function

Comprises the following steps:

(ii) a Wherein,

the hyper-parameters are needed to be set and are used for adjusting the emphasis degree between sentence structure recombination and semantic difference of the model;

the obtained model is named as a similar detection transfer model.

Furthermore, the multiple detection methods in the first stage comprise three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similar sentence pairs, rephrase similar sentence pairs and local similar sentence pairs.

A. When detecting the semantically similar sentence pair, the following operations are executed:

A1. each long text is combined

Dividing the sentence into sentences according to punctuations representing sentence division;

A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as

；

A3. Feature vector to sentence

De-duplication to obtain

(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as

；

A4. Computing

The t percentile of the medium vector distance is used as a similarity threshold value

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pairs are semantic similarity sentence pairs;

B. detecting the sentence pairs which are converted into similar types, and executing the following operations:

B1. each long text is combined

Punctuation divided by sentencesCutting into sentences;

B2. extracting feature vectors of all sentences by using a similar detection model of the transcription, and recording the feature vectors as

；

B3. Feature vector to sentence

De-repeating to obtain

(ii) a For each feature vector, finding TOPK similar vectors; all vectors obtained are pair-counted as

；

B4. Calculating out

；

B5. Filtering out of

Distance of medium feature vector is less than

The sentence pair is the sentence pair with similar type;

C. detecting sentence pairs of local similarity type, executing the following operations:

C1. each long text is combined

Dividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;

C2. using semantic similarityExtracting feature vectors of all clauses by the detection model, and recording the feature vectors as

；

C3. Feature vector to sentence

De-duplication to obtain

(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into

；

C4. Computing

；

C5. Filtering out

Distance of medium feature vector is less than

The clause pair of (1);

C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.

Furthermore, after the detection results are merged and summarized by the three types of similar sentences, the numerical values are subjected to standardization processing according to the total length of the text, and the basic similarity of the long text is obtained.

Further, the calculation of the basic similarity is:

is provided with two long texts

，

Detecting to detect

And

in

If the sentences are similar, the basic similarity of the two long texts

Calculated as follows:

wherein,

and

the total number of sentences in the two long texts respectively.

Further, the long text and the basic similarity thereof are expressed into a similar sentence relation graph

(ii) a Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts

，

If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;

for long text

There are feature vectors:

；

two long texts

，

Weights of edges between corresponding nodes

；

Is the base similarity.

Further, information transmission and aggregation operation are carried out twice on the relational graph to obtain new node characteristic information

And updating; the calculation method is as follows:

wherein,

、

is shown in the figure

Upper node

、

An initial feature vector value of;

The weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images;

、

is shown as a drawing

Upper node

、

The feature vector value after the first updating; finally, the node feature vector is obtained

Wherein, in the process,

represents a long text

And long text

The text similarity of (3).

Compared with the prior art, the invention has the following beneficial effects:

by utilizing the technical scheme provided by the invention, when the similarity of the long text is calculated, the long text is divided into fine-grained sentences for coding and comparing, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and enabling the nodes to represent and fuse group information through information propagation and aggregation on the graph; meanwhile, similar sentences can be visually checked, the similarity calculation method of the long text provided by the invention can enable the similarity of the long text to have stronger interpretability, and the effectiveness and the precision of text processing are improved.

Drawings

Fig. 1 is a block diagram of a two-stage process for calculating similarity of long texts according to the present invention.

FIG. 2 is a block flow diagram of the similar sentence detection stage of the method of the present invention.

FIG. 3 is a flow diagram of the graph structure calculation phase of the method of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm. For a group of long texts, in a first stage, detecting similar sentence pairs between each long text by using a plurality of detection paths; in the second stage, the matched sentence pairs are aggregated into a graph according to the source of the sentence pairs, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, obtaining the text similarity between the long texts according to the node characteristics.

Fig. 1 shows a process of calculating similarity of a long text based on two stages of a deep learning model and a graph algorithm according to the present invention. The method comprises the following steps:

the first stage is a similar sentence detection stage:

1) and constructing a sentence vector extraction model based on a deep learning model (a BERT model or a RoBERTA model can be used), and extracting a sentence vector for sentences and clauses in the long text.

2) And detecting to obtain various types of similar sentence pairs according to the similarity of the sentence vectors.

The second phase is the graph structure calculation phase:

3) the similar sentence pairs are constructed into a graph structure according to the source (in long text).

4) And carrying out information transmission and aggregation operation on the graph, and updating the node characteristic information.

The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.

The method comprises the following concrete implementation steps:

1) the long text is divided into sentences according to punctuations representing the sentence division, and is divided into clauses according to commas inside the sentences. Sentence vectors of sentences and sub-sentences are respectively extracted by using sentence vector extraction models (including a semantic similarity detection model and a transcription similarity detection model) which are adjusted by contrast learning.

2) And detecting similar sentence vectors according to the distance of the sentence vectors aiming at three types of sentence text similar modes of semantic similar type, rephrase similar type and local similar type to obtain corresponding similar sentence pairs.

3) And after the detection results of the three types of similar sentences are merged and counted, the counting results are standardized by using the sentence number of the long text. Similar sentence pairs are aggregated into a graph by their source. Each long text represents a node on the graph, and the weight of an edge is the number of similar sentence pairs between two long texts.

4) And performing information transmission and aggregation operation twice on the graph, and updating the node characteristics after fusing group information.

The invention is further illustrated by the following examples.

Example 1

Regarding the electronic book with N text formats as N long texts

The method provided by the invention is used for calculating the text similarity between every two long texts. The method comprises two stages, a similar sentence detection stage and a graph structure calculation stage (as shown in fig. 1).

1) Before detecting similar sentences, a sentence vector extraction model is constructed and used for converting texts into sentence vectors. Firstly, all long texts are segmented into sentences; then, a sentence vector extraction model is obtained through a comparison learning fine-tuning pre-trained language Representation model BERT Bidirectional Encoder reproduction from transformations model (or RoBERTA model); converting the text sentence into a sentence vector through a sentence vector extraction model;

11) and finely adjusting the BERT model by performing sentence semantic similarity comparison learning training to obtain a semantic similarity detection model.

For each segmented sentence, first, a sentence vector is extracted from the sentence

. For sentence vector extracted from sentence

A positive example of the comparative learning of the sentence is constructed by performing a drop method (Dropout) process, and vectors extracted from the text of other sentences in a training batch are taken as negative examples of the comparative learning of the sentence. The training loss function adopts a loss function calculated based on positive and negative examples of Sentence vectors and constructions, and the design is the same as that of SimCSE (Simple contrast Learning of sequence Embeddings). Naming the trained model as a semantic similarity detection model, and recording as

。

12) And (4) finely adjusting the BERT model by carrying out sentence transfer similarity comparison learning training to obtain a transfer similarity detection model.

For each sentence, first, a sentence vector is extracted from the sentence

. The penalty function for the BERT model fine tuning comprises

And with

Two parts.

And

the loss function of (2) is the same. In the calculation of

And then, for each sentence, dividing the sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text. Sentence vector extracted from new sentence text

Performing Dropout processing to construct a positive example of contrast learning of the sentence, extracting vectors from text of other sentences in a training batch

As a negative example of comparative learning of this sentence.

Is based on sentence vectors

And positive examples of construction

And negative example

The design of the calculated loss function is the same as that of sentence embedding simple contrast learning SimCSE. The loss function for the final training is:

wherein,

is a hyper-parameter that needs to be set and which adjusts the degree of emphasis between model versus sentence structural reorganization and semantic differences. The obtained model is named as a similar detection model and is recorded as

。

2) Similar sentence pairs are detected among the long texts T (as shown in fig. 2, three similar sentence pairs are detected by designing a detection method for the three similar sentence pairs in specific implementation).

A1. each long text is combined

；

A3. Feature vector to sentence

De-duplication to obtain

；

A4. Computing

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pair is a semantic similarity sentence pair;

B. detecting sentence pairs which are converted into similar sentences, and executing the following operations:

B1. each long text is combined

B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as

；

B3. Feature vector to sentence

De-repeating to obtain

(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are pair-counted as

；

B4. Calculating out

；

B5. Filtering out

Distance of medium feature vector is less than

The sentence pair is the sentence pair with similar type;

C1. each long text is combined

C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as

；

C3. Feature vector to sentence

De-duplication to obtain

；

C4. Computing

；

C5. Filtering out of

Distance of medium feature vector is less than

The clause pair of (1);

C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs are locally similar.

After the detected similar sentence pair result is obtained, the graph structure calculation stage is entered (as shown in fig. 3).

3) And merging and summarizing the detection results of the three types of similar sentences, and then carrying out standardization processing on numerical values according to the total length of the text to obtain the basic similarity of the long text. In particular, assume that there are two long texts

，

Detecting to detect

And

in

Each sentence is similar (including three types of similarity), and the total number of sentences in the two long texts is

And

the basic similarity of the two long texts

Calculated as follows:

4) abstract representation of long text and basic similarity thereof into similar sentence relation graph

. Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; for long text

There are feature vectors:

if two long texts

，

There is a similar sentence between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge has:

is the basic similarity of the previous step of calculation;

5) Carrying out information transmission and aggregation operation twice on the relational graph to obtain new node characteristic information

And updated. The calculation method is as follows:

wherein,

the weights are self-defined by the first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images. Finally, the node feature vector is obtained

Wherein, in the process,

represents a long text

And long text

The text similarity of (2).

The method is adopted to calculate the similarity of the long text, the long text is divided into fine-grained sentences for coding and comparison, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and propagating and aggregating the information on the graph, wherein the nodes represent the fused group information; similar sentences can be visually checked, so that long texts have strong interpretability, and the effectiveness and the precision of text processing are improved.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A two-stage method for calculating similarity of long text is characterized in that,

in a first stage, a similar sentence detection stage, comprising:

11) constructing a sentence vector extraction model based on a deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transcription similarity detection model;

12) converting the text into sentence vectors through the sentence vector extraction model, and detecting by adopting a plurality of detection methods to obtain a plurality of similar sentence pairs of similar types between each long text, wherein the steps of: semantic similar sentence pairs, rephrased similar sentence pairs and local similar sentence pairs;

in the second stage, the graph structure calculation stage includes:

21) calculating to obtain basic similarity;

22) constructing a similar sentence relation graph structure according to the long text similar sentence pair and the basic similarity; each node on the similar sentence relation graph represents a long text; edges between the nodes represent that similar sentences exist between two long texts corresponding to the nodes;

23) performing information transmission and aggregation operation twice on the similar sentence relation graph through the operation of the similar sentence relation graph to obtain high-level node representation of the fusion group information, thereby obtaining and updating new node characteristic information;

The value of each dimension on the node feature vector is the text similarity between corresponding long texts; and according to the node characteristics, obtaining the text similarity between the long texts.

2. The two-stage method for calculating similarity of long texts according to claim 1, wherein before the similar sentence detection stage, each long text is first divided into sentences; obtaining a sentence vector extraction model by comparing, learning, fine-tuning and pre-training a language representation model BERT model or RoBERTA model; and respectively extracting sentence vectors of long text sentences and clauses through a semantic similarity detection model and a transfer similarity detection model included by the sentence vector extraction model, so that the long texts are converted into the sentence vectors.

3. The two-stage long text similarity calculation method according to claim 2, further comprising obtaining a sentence vector extraction model by:

processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;

naming the trained model as a semantic similarity detection model;

extracting sentence vectors from the sentence text;

dividing each sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence text to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;

the penalty function for the BERT model fine tuning comprises

And

；

；

Final loss function

Comprises the following steps:

(ii) a Wherein,

the hyper-parameters are needed to be set and are used for adjusting the degree of emphasis between sentence structure recombination and semantic difference of the model;

the obtained model is named as a transfer similar detection model.

4. The two-stage method for calculating similarity of long texts according to claim 1, wherein the plurality of detection methods in the first stage comprise three methods for detecting similar sentence pairs, such as semantic similar sentence pair, rephrase similar sentence pair and local similar sentence pair.

5. The two-stage long text similarity calculation method according to claim 4, further comprising, in the step of,

A1. each long text is combined

A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and recording the feature vectors as

；

A3. Feature vector to sentence

De-repeating to obtain

(ii) a For each feature vector, finding TOPK similar vectors; and all the obtained vector pairs are recorded as

；

A4. Calculating out

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pairs are semantic similarity sentence pairs;

B1. each long text is combined

；

B3. Feature vector to sentence

De-duplication to obtain

；

B4. Computing

；

B5. Filtering out of

Distance of medium feature vector is less than

The sentence pair is a similar sentence pair;

C. and detecting sentence pairs of local similarity, and executing the following operations:

C1. each long text is combined

After the punctuation marks representing sentence division are divided into sentences, the sentences are divided into clauses according to commas;

C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model, and recording the feature vectors as

；

C3. Feature vector to sentence

De-duplication to obtain

；

C4. Computing

；

C5. Filtering out

Distance of medium feature vector is less than

The clause pair of (1);

6. The two-stage method for calculating similarity of long texts as claimed in claim 5, wherein after the three types of similar sentences are merged and collected to the detection result, the value is standardized according to the total length of the text to obtain the basic similarity of the long text.

7. The two-stage method for calculating similarity of long texts as claimed in claim 6, further wherein the calculating of the basic similarity is:

Is provided with two long texts

，

Detecting to detect

And

in

If each sentence is similar, the basic similarity of two long texts

Calculated as follows:

wherein,

and

the total number of sentences in the two long texts.

8. The two-stage method for calculating similarity of long texts as claimed in claim 7, further comprising representing the long texts and their basic similarities as a relation graph of similar sentences

(ii) a Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

，

for long text

There are feature vectors:

；

two long texts

，

Weights of edges between corresponding nodes

；

Is the base similarity.

9. The two-stage long text similarity calculation method according to claim 8, further comprising performing information transmission and aggregation operation twice on the relationship graph to obtain new node feature information

And updating; the calculation method is as follows:

wherein,

、

Is shown as a drawing

Upper node

、

Wherein, in the process,

represents a long text

And long text

The text similarity of (2).