CN114398867A

CN114398867A - Two-stage long text similarity calculation method

Info

Publication number: CN114398867A
Application number: CN202210298133.6A
Authority: CN
Inventors: 段思宇; 苏祺; 王军
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-04-26
Anticipated expiration: 2042-03-25
Also published as: CN114398867B

Abstract

The invention discloses a two-stage long text similarity calculation method, wherein in a first stage similar sentence detection stage, a sentence vector extraction model is constructed based on a deep learning model, and a text is converted into a sentence vector; detecting to obtain a plurality of similar sentence pairs with similar types between each long text; in the second stage of graph structure calculation, the basic similarity is calculated; expressing the long text similar sentence pair and the basic similarity into a similar sentence relation graph; each node on the graph represents a long text; obtaining high-level node representation of the fusion group information through operation; updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts; i.e. to obtain text similarity between long texts. The method of the invention can make the similarity of the long text have stronger interpretability and improve the effectiveness and the precision of text processing.

Description

Two-stage long text similarity calculation method

Technical Field

The invention relates to a text similarity calculation method, in particular to a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

Background

Text similarity calculation is an important task of natural language processing, and related technologies aim to measure the degree of similarity between texts by using technical means. For texts with different lengths, different text similarity calculation methods need to be adapted. When the similarity of long texts is calculated, a large amount of text information needs to be extracted, compressed and matched for calculation, and the method has important application in the aspects of news recommendation, article recommendation, quotation recommendation, document clustering and the like.

In the prior art, a method based on keyword extraction is mostly adopted, a few keywords are extracted to be used as representatives of a long text, and then the keywords participate in further similarity calculation. Since the calculation result depends on a few keywords, the method loses a large amount of semantic information and has poor robustness.

The deep learning model-based method uses the deep learning model to encode the full text and then calculates the similarity of the full text. However, the existing deep learning model can only achieve a good coding effect on a text sequence with a length of hundreds of words. While long texts like books often have tens of thousands of characters or even hundreds of thousands of characters, the existing models cannot code well. Moreover, since the similarity calculation is performed in a hidden space, the interpretability is poor.

In addition, the two technologies only consider the information between the compared long texts, and the calculation process is relatively isolated and lacks of utilization of group information.

Disclosure of Invention

The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm.

The principle of the invention is as follows: for a group of long texts, detecting to obtain similar sentence pairs between each long text by using a plurality of detection methods in a first stage; in the second stage, the similar sentence pairs are combined and summarized according to the long texts where the similar sentences are located, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.

The technical scheme provided by the invention is as follows:

a two-stage long text similarity calculation method comprises the following steps:

in a first stage, a similar sentence detection stage, comprising:

constructing a sentence vector extraction model based on the deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;

converting the text into sentence vectors through a sentence vector extraction model;

detecting to obtain a plurality of similar sentence pairs of similar types between each long text by using a plurality of detection methods;

in a second stage graph structure computation stage, comprising:

calculating to obtain basic similarity;

based on a graph algorithm, expressing the long text similar sentence pairs and the basic similarity into a similar sentence relation graph; each node on the similar sentence relation graph represents a long text;

obtaining high-level node representation of the fusion group information through inference interactive operation of a similar sentence relation graph;

updating node feature information, wherein the value of each dimension on a node feature vector is the text similarity between corresponding long texts;

and obtaining the text similarity between the long texts according to the node characteristics.

Further, before the similar sentence detection stage, the two-stage long text similarity calculation method firstly divides each long text into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of the long text sentences and the clauses through a semantic similarity detection model and a transfer similarity detection model which are included in the sentence vector extraction model, so that the long texts are converted into the sentence vectors.

Further, a sentence vector extraction model is obtained by the following steps:

11) performing sentence semantic similarity comparison learning training, and fine-tuning a BERT model to obtain a semantic similarity detection model; the method comprises the following steps:

processing the sentence vectors obtained by extraction by adopting a discarding method, and constructing to obtain a positive example of comparative learning;

taking other sentence vectors in a training batch as negative examples of comparative learning;

the loss function used for training adopts a loss function calculated on the basis of positive and negative examples of sentence vector sum construction;

the trained model is named as a semantic similarity detection model;

12) performing sentence transfer similarity comparison learning training, and fine-tuning a BERT model to obtain a transfer similarity detection model; the method comprises the following steps:

extracting sentence vectors from the sentence text;

dividing each sentence into clauses according to commas, and randomly selecting and disordering clauses in the sentence text to obtain a new sentence text; adopting a discarding method to process sentence vectors extracted from the new sentence texts to construct a positive example of comparative learning; taking vectors extracted from texts of other sentences in a training batch as negative examples of comparative learning;

the loss function for the BERT model fine tuning comprises

And

；

the same loss function as that adopted in step 11); the calculation is based on positive and negative examples of sentence vector sum structure to obtain loss function

；

Final loss function

Comprises the following steps:

(ii) a Wherein the content of the first and second substances,

is to requireThe hyper-parameters to be set are used for adjusting the degree of emphasis between sentence structure recombination and semantic difference of the model;

the obtained model is named as a transfer similar detection model.

Furthermore, the multiple detection methods in the first stage comprise three detection methods of similar sentence pairs; the three similar sentence pairs are: semantic similar sentence pairs, rephrased similar sentence pairs, and local similar sentence pairs.

A. And when detecting the semantically similar sentence pair, executing the following operations:

A1. each long text is combined

Dividing the sentence into sentences according to punctuations representing sentence division;

A2. extracting the feature vectors of all sentences by using a semantic similarity detection model, and marking as

；

A3. Feature vector to sentence

De-duplication to obtain

(ii) a For each feature vector, finding its TOPK similar vectors; and all the obtained vector pairs are recorded as

；

A4. Computing

The t percentile of the medium vector distance is used as a similarity threshold value

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pairs are semantic similarity sentence pairs;

B. detecting the sentence pairs which are converted into similar types, and executing the following operations:

B1. each long text is combined

B2. extracting the feature vectors of all sentences by using a transcription similarity detection model and recording the feature vectors as

；

B3. Feature vector to sentence

De-duplication to obtain

(ii) a For each feature vector, finding its TOPK similar vectors; all vectors obtained are paired into

；

B4. Computing

；

B5. Filtering out

Distance of medium feature vector is less than

The sentence pair is the sentence pair with similar type;

C. detecting sentence pairs of local similarity type, executing the following operations:

C1. each long text is combined

Dividing the sentence into sentences according to punctuations representing sentence division, and dividing the sentence into clauses according to commas;

C2. extracting the feature vectors of all the clauses by using a semantic similarity detection model and marking as

；

C3. Feature vector to sentence

De-duplication to obtain

；

C4. Computing

；

C5. Filtering out

Distance of medium feature vector is less than

The clause pair of (1);

C6. and tracing the successfully matched clause pairs to corresponding sentence pairs, namely the sentence pairs with local similarity.

And further, combining and summarizing the detection results by using three types of similar sentences, and then standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text.

Further, the calculation of the basic similarity is:

is provided with two long texts

，

Detect that

And

in (1)

If the sentences are similar, the basic similarity of the two long texts

Calculated as follows:

wherein the content of the first and second substances,

and

the total number of sentences in the two long texts respectively.

Further, the long text and the basic similarity thereof are expressed into a similar sentence relation graph

(ii) a Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; if two long texts

，

If similar sentences exist between the nodes, an edge exists between the nodes corresponding to the long text;

for long text

There are feature vectors:

；

two long texts

，

Weights of edges between corresponding nodes

；

Is the base similarity.

Further, information transmission and aggregation operation are carried out twice on the relational graph to obtain new node characteristic information

And updating; the calculation method is as follows:

wherein the content of the first and second substances,

、

is shown in the figure

Upper node

、

An initial feature vector value of;

the weights are self-defined by first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images;

、

is shown as a drawing

Upper node

、

The characteristic vector value after the first updating; finally obtainTo-node feature vector

Wherein, in the step (A),

represents a long text

And long text

The text similarity of (2).

Compared with the prior art, the invention has the beneficial effects that:

by utilizing the technical scheme provided by the invention, when the similarity of the long text is calculated, the long text is divided into fine-grained sentences for coding and comparing, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and enabling the nodes to represent and fuse group information through information propagation and aggregation on the graph; meanwhile, similar sentences can be visually checked, the similarity calculation method of the long text provided by the invention can enable the similarity of the long text to have stronger interpretability, and the effectiveness and the precision of text processing are improved.

Drawings

Fig. 1 is a block diagram of a two-stage process for calculating similarity of long texts according to the present invention.

FIG. 2 is a block flow diagram of the similar sentence detection stage of the method of the present invention.

FIG. 3 is a flow diagram of the graph structure calculation phase of the method of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a two-stage long text similarity calculation method based on a deep learning model and a graph algorithm. For a group of long texts, in a first stage, detecting similar sentence pairs between each long text by using a plurality of detection paths; in the second stage, the matched sentence pairs are aggregated into a graph according to the source of the sentence pairs, each long text is abstractly represented as a node on the graph, inference interactive operation on the graph is carried out, information is transmitted among the nodes, and high-level node representation fused with group information is obtained; and finally, according to the node characteristics, obtaining the text similarity between the long texts.

Fig. 1 shows a process of calculating similarity of a long text based on two stages of a deep learning model and a graph algorithm according to the present invention. The method comprises the following steps:

the first stage is a similar sentence detection stage:

1) and constructing a sentence vector extraction model based on a deep learning model (a BERT model or a RoBERTA model can be used), and extracting a sentence vector for sentences and clauses in the long text.

2) And detecting to obtain various types of similar sentence pairs according to the similarity of the sentence vectors.

The second phase is the graph structure calculation phase:

3) the similar sentence pairs are constructed into a graph structure according to the source (in long text).

4) And carrying out information transmission and aggregation operation on the graph, and updating the node characteristic information.

The value of each dimension on the node feature vector corresponds to the text similarity between the corresponding long texts.

The method comprises the following concrete implementation steps:

1) the long text is divided into sentences according to punctuations representing the sentence division, and is divided into clauses according to commas inside the sentences. Sentence vectors of sentences and sub-sentences are respectively extracted by using sentence vector extraction models (including a semantic similarity detection model and a transcription similarity detection model) which are adjusted by contrast learning.

2) And detecting similar sentence vectors according to the distance of the sentence vectors aiming at three types of sentence text similar modes of semantic similar type, rephrase similar type and local similar type to obtain corresponding similar sentence pairs.

3) And after the detection results of the three types of similar sentences are merged and counted, the counting results are standardized by using the sentence number of the long text. Similar sentence pairs are aggregated into a graph by their source. Each long text represents a node on the graph, and the weight of an edge is the number of similar sentence pairs between two long texts.

4) And performing information transmission and aggregation operation twice on the graph, and updating the node characteristics after fusing group information.

The invention is further illustrated by the following examples.

Example 1

Regarding the electronic book with N text formats as N long texts

The method provided by the invention is used for calculating the text similarity between every two long texts. The method comprises two stages, a similar sentence detection stage and a graph structure calculation stage (as shown in fig. 1).

1) Before similar sentence detection, a sentence vector extraction model is constructed and obtained for converting texts into sentence vectors. Firstly, all long texts are divided into sentences; then, a sentence vector extraction model is obtained through a comparison learning fine-tuning pre-trained language Representation model BERT Bidirectional Encoder reproduction from transformations model (or RoBERTA model); converting the text sentence into a sentence vector through a sentence vector extraction model;

11) and finely adjusting the BERT model by performing sentence semantic similarity comparison learning training to obtain a semantic similarity detection model.

For each segmented sentence, first, a sentence vector is extracted from the sentence

. For sentence vector extracted from sentence

A positive example of the comparative learning of the sentence is constructed by performing a drop out process, with vectors extracted from the text of other sentences in a training batch as negative examples of the comparative learning of the sentence. The training loss function is calculated based on positive and negative examples of Sentence vector sum construction, and the design is the same as that of SimCSE (Simple contrast Learning of sequence Embeddings). The trained model is named as a semantic similarity detection model and is recorded as

。

12) And (4) finely adjusting the BERT model by carrying out sentence transfer similarity comparison learning training to obtain a transfer similarity detection model.

For each sentence, first, a sentence vector is extracted from the sentence

. The loss function for the BERT model fine tuning comprises

And

two parts.

And

the loss function of (2) is the same. In the calculation of

And then, for each sentence, dividing the sentence into clauses according to commas, and randomly selecting and disordering the clauses in the sentence text to obtain a new sentence text. Sentence vector extracted from new sentence text

Performing Dropout processing to construct a positive example of contrast learning of the sentence, extracting vectors from text of other sentences in a training batch

As a negative example of comparative learning of the sentence.

Is based on sentence vectors

And positive examples of construction

And negative examples

The design of the calculated loss function is the same as that of sentence embedding simple contrast learning SimCSE. The loss function of the final training is:

wherein the content of the first and second substances,

is a hyper-parameter that needs to be set that adjusts the degree of emphasis between model versus sentence structural reorganization and semantic differences. The obtained model is named as a transfer similarity detection model and is recorded as

。

2) Similar sentence pairs are detected among the long texts T (as shown in fig. 2, three similar sentence pairs are detected by designing a detection method for the three similar sentence pairs in specific implementation).

A1. each long text is combined

；

A3. Feature vector to sentence

De-duplication to obtain

；

A4. Computing

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pairs are semantic similarity sentence pairs;

B1. each long text is combined

Object segmented by representation sentenceDividing the point symbols into sentences;

；

B3. Feature vector to sentence

De-duplication to obtain

；

B4. Computing

；

B5. Filtering out

Distance of medium feature vector is less than

The sentence pair is the sentence pair with similar type;

C1. each long text is combined

C2. using semantic similarity detection model extractionTaking the feature vectors of all clauses and recording as

；

C3. Feature vector to sentence

De-duplication to obtain

；

C4. Computing

；

C5. Filtering out

Distance of medium feature vector is less than

The clause pair of (1);

After the detected similar sentence pair result is obtained, the graph structure calculation stage is entered (as shown in fig. 3).

3) And combining and summarizing the detection results of the three types of similar sentences, and standardizing the numerical values according to the total length of the text to obtain the basic similarity of the long text. Specifically, assume that there are two long texts

，

Detect that

And

in (1)

Each sentence is similar (including three types of similarity), and the total number of sentences in the two long texts is

And

the basic similarity of the two long texts

Calculated as follows:

4) abstract representation of long text and basic similarity thereof into similar sentence relation graph

. Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

(ii) a The node characteristic is a single hot vector, and the dimensionality of the vector is the total number N of the long texts; for long text

There are feature vectors:

if two long texts

，

There is a similar sentence between them, there is an edge between the nodes corresponding to the long text, and the weight of the edge has:

is the basic similarity of the previous step of calculation;

5) carrying out information transmission and aggregation operation twice on the relational graph to obtain new node characteristic information

And updated. The calculation method is as follows:

wherein the content of the first and second substances,

the weights are self-defined by the first and second operations respectively and are used for adjusting the proportion of information aggregation on the two images. Finally, the node feature vector is obtained

Wherein, in the step (A),

represents a long text

And long text

The text similarity of (2).

The method is adopted to calculate the similarity of the long text, the long text is divided into fine-grained sentences for coding and comparison, and the semantic information of the compared text is fully utilized; abstracting the long text to nodes on the graph, and propagating and aggregating the information on the graph, wherein the nodes represent the fused group information; similar sentences can be visually checked, so that long texts have strong interpretability, and the effectiveness and the precision of text processing are improved.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A two-stage method for calculating the similarity of long texts is characterized in that,

in a first stage, a similar sentence detection stage, comprising:

11) constructing a sentence vector extraction model based on a deep learning model, wherein the sentence vector extraction model comprises a semantic similarity detection model and a transfer similarity detection model;

12) converting the text into sentence vectors through the sentence vector extraction model, and detecting to obtain multiple similar sentence pairs of similar types between each long text by adopting multiple detection methods, wherein the method comprises the following steps: semantic similar sentence pairs, rephrased similar sentence pairs and local similar sentence pairs;

in a second stage graph structure computation stage, comprising:

21) calculating to obtain basic similarity;

22) constructing a similar sentence relation graph structure according to the long text similar sentence pairs and the basic similarity; each node on the similar sentence relation graph represents a long text; edges between the nodes represent that similar sentences exist between two long texts corresponding to the nodes;

23) performing information transmission and aggregation operation twice on the similar sentence relation graph through the operation of the similar sentence relation graph to obtain high-level node representation of the fusion group information, thereby obtaining and updating new node characteristic information;

the value of each dimension on the node feature vector corresponds to the text similarity between the long texts; and obtaining the text similarity between the long texts according to the node characteristics.

2. The two-stage method for calculating similarity of long texts according to claim 1, wherein before the similar sentence detection stage, each long text is first divided into sentences; obtaining a sentence vector extraction model by comparing a learning fine-tuning pre-trained language representation model BERT model or a RoBERTA model; and respectively extracting sentence vectors of long text sentences and clauses through a semantic similarity detection model and a transfer similarity detection model included by the sentence vector extraction model, so that the long texts are converted into the sentence vectors.

3. The two-stage long text similarity calculation method according to claim 2, further comprising obtaining a sentence vector extraction model by:

the trained model is named as a semantic similarity detection model;

extracting sentence vectors from the sentence text;

the loss function for the BERT model fine tuning comprises

And

；

；

Final loss function

Comprises the following steps:

(ii) a Wherein the content of the first and second substances,

is a hyper-parameter to be set for adjusting the sentence-structure reorganization and the semantic difference of the modelThe degree of weight between the differences;

the obtained model is named as a transfer similar detection model.

4. The two-stage method for calculating similarity of long texts according to claim 1, wherein the plurality of detection methods in the first stage comprise three methods for detecting similar sentence pairs, such as semantic similar sentence pair, rephrase similar sentence pair and local similar sentence pair.

5. The two-stage long text similarity calculation method according to claim 4, further comprising,

A1. each long text is combined

；

A3. Feature vector to sentence

De-duplication to obtain

；

A4. Computing

；

A5. Filtering out

Distance of medium feature vector is less than

The sentence pairs are semantic similarity sentence pairs;

B1. each long text is combined

；

B3. Feature vector to sentence

De-duplication to obtain

；

B4. Computing

；

B5. Filtering out

Distance of medium feature vector is less than

The sentence pair is the sentence pair with similar type;

C1. each long text is combined

；

C3. Feature vector to sentence

De-duplication to obtain

；

C4. Computing

；

C5. Filtering out

Distance of medium feature vector is less than

The clause pair of (1);

6. The two-stage method for calculating similarity of long texts as claimed in claim 5, wherein after the three types of similar sentences are merged and collected to the detection result, the value is standardized according to the total length of the text to obtain the basic similarity of the long text.

7. The two-stage method for calculating similarity of long texts as claimed in claim 6, further wherein the calculating of the basic similarity is:

is provided with two long texts

，

Detect that

And

in (1)

If the sentences are similar, the basic similarity of the two long texts

Calculated as follows:

wherein the content of the first and second substances,

and

the total number of sentences in the two long texts respectively.

8. The two-stage method for calculating similarity of long texts as claimed in claim 7, further comprising representing the long texts and their basic similarities as a relation graph of similar sentences

(ii) a Each node in the similar sentence relationship graph

Represents a long piece of text that is displayed in a long format,

，

for long text

There are feature vectors:

；

two long texts

，

Weights of edges between corresponding nodes

；

Is the base similarity.

9. The two-stage long text similarity calculation method according to claim 8, further comprising performing information transmission and aggregation operation twice on the relationship graph to obtain new node feature information

And updating; the calculation method is as follows:

wherein the content of the first and second substances,

are self-defined weights for the first and second operations, respectively, for adjusting the weights twiceScale of aggregation of information on the graph;

、

is shown as a drawing

Upper node

、

The characteristic vector value after the first updating; finally, the node feature vector is obtained

Wherein, in the step (A),

represents a long text

And long text

The text similarity of (2).