CN116306667A

CN116306667A - Text matching method and system for long text

Info

Publication number: CN116306667A
Application number: CN202310131234.9A
Authority: CN
Inventors: 彭程; 王佳睿; 谢季; 刘峰荣; 余鸿; 任思远; 何智毅; 陈科
Original assignee: Chengdu Zhongke Information Technology Co ltd; Chengdu Information Technology Co Ltd of CAS
Current assignee: Chengdu Zhongke Information Technology Co ltd; Chengdu Information Technology Co Ltd of CAS
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-06-23

Abstract

The invention discloses a text matching method and system for long texts. According to the invention, a text to be matched is input into a sentence-level filter to filter noise sentences and extract key sentences, then the key sentences are input into a word-level filter, deep interaction characteristics among the texts are mined by utilizing a BERT model which is integrated with a PageRank algorithm, and word-level noise filtering and fine granularity matching operations are carried out on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The invention has the positive effects that: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.

Description

Text matching method and system for long text

Technical Field

The invention relates to the technical field of computers, in particular to a text matching method and system for long texts.

Background

Text matching is used as a key task in natural language processing applications such as community questions and answers, information retrieval and dialogue systems and the like, and aims to analyze and judge semantic association between source text and target text. The long text matching is an important sub-direction in the text matching field, can rapidly judge the relationship between two chapters, and can identify whether the topic expressions of the two chapters are similar or not, thereby having great research and application values.

The text matching model has two solution ideas: a traditional model and a depth model. Solving thought of traditional model: the text matching degree is measured by manually defining and extracting the features and using the extracted features. The method has the problems of high cost of manually extracting the features, incomplete feature extraction and the like. Moreover, the method is essentially a surface layer matching method, and cannot realize a deeper semantic matching task.

Solution thinking based on depth model: and coding the text by utilizing the strong language characterization capability of the deep neural network, mining the text deep semantic information, and performing matching operation in a semantic space. The method can achieve higher accuracy without manually setting the characteristics. Currently, most depth models are designed for short text (i.e., short text depth matching models), and long inter-text fine-grained matching signals are typically sparse. When long text is matched by using a short text depth matching model, it is difficult to identify a matching signal from a large number of noise signals, resulting in an unsatisfactory matching effect.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a text matching method and a system for long texts.

The technical scheme adopted for solving the technical problems is as follows: a text matching method for long text, the text matching method comprising the steps of:

inputting two texts to be matched into sentence-level filters, wherein the sentence-level filters respectively extract corresponding text key sentences for each text to be matched;

inputting the text key sentence into a word level filter, performing word level noise filtering and fine granularity matching operation by the word level filter, and outputting a text after filtering and matching;

outputting the position vectors corresponding to the two filtered and matched texts by using a BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts;

and splicing the characteristic expression vectors of the two texts and the characteristic expression vector of the CLS identifier in the BERT model, and inputting the characteristic expression vector into a fully-connected neural network to predict a similarity score, wherein the similarity score is used as a basis for judging whether the texts are matched.

Further, the step of extracting text key sentences by the sentence-level filter specifically includes:

constructing a graph model through a TextRank algorithm;

the graph model captures the inter-sentence similarity inside the text and the inter-sentence similarity between the texts;

and extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.

Further, the step of constructing the graph model through the TextRank algorithm specifically includes:

text of will source

And target text->

All sentences input, the L ₁ ,L ₂ Total number of sentences in the source text and the target text, respectively, said +.>

Is each sentence in the source text, said +.>

Is each sentence in the target text, said d _s Is a set of source text sentences, said d _t Is a target text sentence collection;

combination d _s D _t Obtaining all sentence sets

And (3) taking sentences in the S as vertexes, taking the similarity among the sentences as the weight of the edge, and constructing a graph model.

Further, the calculation mode of the similarity between sentences is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s _i And sentence s _j Inter-similarity sim(s) _i ,s _j ) The calculation formula is shown as formula (1):

the s is _i Sum s _j I and j sentences, respectively, the sim (s _i ,s _j ) Is s _i Sum s _j Similarity between w _k At s _i Sum s _j Simultaneously occurring words of (a).

Further, after the step of constructing the graph model by the TextRank algorithm, the method further comprises the steps of:

scoring sentences by using a TextRank algorithm, and extracting text key sentences according to the scoring size, wherein the sentences s _i Score value W(s) _i ) The iteration is obtained by the formula (2):

the W(s) _i ) For sentence s _i Weight value of the W(s) _j ) For sentence s _j D is a damping coefficient representing a probability of pointing from a certain node to any other node in the graph, and is typically set to 0.85, and s is _i 、s _j Sum s _k Are all sentences in the sentence collection, the sim (s _i ,s _j ) Is s _i Sum s _j Similarity between the sim(s) _j ,s _k ) Is s _j Sum s _k Similarity between the two.

Further, the step of filtering the word-level noise by the word-level filter specifically includes:

the word level filter is based on a BERT model, performs word deletion strategies by fusing a PageRank algorithm and an Attention matrix, and screens and deletes word level noise information of a hidden layer.

It is another object of the present invention to provide a text matching system for long text, the system comprising:

the sentence-level filter is used for receiving two text inputs to be matched and respectively extracting corresponding text key sentences for each text to be matched;

the word level filter is used for receiving text key sentence input, carrying out word level noise filtering and fine granularity matching operation on the text key sentence, and outputting a text after filtering and matching;

the vector acquisition module is used for outputting position vectors corresponding to the two filtered and matched texts by utilizing the BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts; and

and the similarity analysis module is used for splicing the characteristic representation vectors of the two texts and the characteristic representation vector of the CLS identifier in the BERT model, inputting the characteristic representation vector into the fully-connected neural network to predict a similarity score, and taking the similarity score as a basis for judging whether the texts are matched.

Further, the sentence-level filter includes:

the diagram model construction module is used for constructing a diagram model through a TextRank algorithm;

the similarity capturing module is used for capturing the inter-sentence similarity in the texts and the inter-sentence similarity between the texts by the graph model; and

and the key sentence extraction module is used for extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.

Further, the graph model construction module specifically includes:

sentence input module for inputting source text

And target text->

All sentences are input; the L is ₁ ,L ₂ Total number of sentences in the source text and the target text, respectively, said +.>

Is each sentence in the source text, said +.>

sentence combination module for combining d _s D _t Obtaining all sentence sets

and

And the graph model generation module is used for constructing a graph model by taking sentences in the S as vertexes and the similarity among the sentences as the weight of the edge.

According to the invention, a text to be matched is input into a sentence-level filter to filter noise sentences and extract key sentences, then the key sentences are input into a word-level filter, deep interaction characteristics among the texts are mined by utilizing a BERT model which is integrated with a PageRank algorithm, and word-level noise filtering and fine granularity matching operations are carried out on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The text matching method provided by the invention is used for deleting noise sentences and noise words in a long text and matching by using the simplified information. Compared with the prior art, the invention has the following positive effects: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.

Drawings

The invention will now be described by way of example and with reference to the accompanying drawings in which:

FIG. 1 is a flowchart of a text matching method for long text provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a text matching method according to an embodiment of the present invention;

fig. 3 is a block diagram of a text matching system for long text according to an embodiment of the present invention.

Detailed Description

The invention will now be described in further detail with reference to specific examples thereof in connection with the accompanying drawings.

Referring to fig. 1 and fig. 2, fig. 1 shows a flow of a text matching method for long text provided in an embodiment of the present invention, and details are as follows:

in step S101, two texts to be matched are input into sentence-level filters, which extract corresponding text key sentences for each text to be matched, respectively.

As an embodiment of the present invention, the step of extracting text key sentences by the sentence-level filter specifically includes:

constructing a graph model through a TextRank algorithm;

As an embodiment of the present invention, the step of constructing the graph model by the TextRank algorithm specifically includes:

two texts are combined

And->

Is each sentence in the source text, said

combination d _s D _t Obtaining the obtainedWith collection of sentences

The calculation mode of the similarity between sentences (including the sentence similarity inside the text and the sentence similarity between the texts) is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s _i Sum s _j Inter-similarity sim(s) _i ,s _j ) The calculation formula is shown as formula (1):

The method further comprises the following steps of:

In step S102, the text key sentence is input into a word-level filter, and the word-level filter performs word-level noise filtering and fine granularity matching operations, and outputs a text after filtering and matching.

The step of filtering the word level noise by the word level filter specifically comprises the following steps:

the word level filter is based on a BERT model, a PageRank algorithm and an Attention matrix are fused to execute a word deletion strategy, and word level noise information of a hidden layer is screened and deleted, so that a text after filtering and matching is obtained.

As an embodiment of the present invention, word-level filters: and inputting the key sentences of the text into the BERT model to mine fine-granularity interaction semantic information of the key sentences and the key sentences, and scoring the hidden layer word nodes of the BERT by using an attention matrix A and a PageRank algorithm. The scoring numerical value is divided into two parts, wherein the first part firstly builds a graph model on the BERT hidden layer node, the attention matrix A is regarded as an adjacent matrix in the PageRank, then the PageRank algorithm is utilized for iteration, the node importance numerical value u is obtained after convergence, and the t iteration formula is shown as (3):

u ^t+1 ＝d(A ^l ) ^T u ^t +(1-d)/N·Ι (3)

the u is ^t The node weight after the t-th iteration is the A ^l The attention matrix is the attention matrix of the first layer, d is a damping coefficient, N is the number of nodes in the graph, and I is a unit matrix.

Then taking the attention moment array as a weight matrix A, and multiplying the weight matrix A by u to obtain a scoring value R=Au of the first part; the second part directly regards the matrix P obtained by summing the matrix A according to the columns as the initial importance score of the word nodes, takes the attention moment matrix as the weight matrix, and multiplies the attention moment matrix to obtain a score value R under the action of an initialization value ^* =ap. Linearly summing the scoring values of the two parts to obtain a final scoring value R _final ＝αR ^* ++ (1-. Alpha.) R, and R _final Is based on deleting hidden nodes with lower scores.

In step S103, a BERT model is used to output two position vectors corresponding to the text after filtering and matching, the position vectors are input into a 1-dimensional convolutional neural network to perform text semantic feature extraction, and feature expression vectors of comprehensive contexts of the two texts are obtained.

In the output layer of BERT model, the position vector H of each text is obtained by using the position of SEP identifier as separation _s ＝[h ₁ ,h ₂ ...,h _SEP-1 ]，H _t ＝[h _SEP+1 ,h _SEP+2 ...,h _N ]. Respectively inputting the two into a 1-dimensional convolutional neural network to model the text context to obtain feature expression vectors H of the two texts _s ＝Conv(H _s )、H _t ＝Conv(H _t )。

In step S104, the feature expression vectors of the two texts and the feature expression vector of the CLS identifier in the BERT model are spliced, and a predicted similarity score in the fully connected neural network is input, where the similarity score is used as a basis for determining whether the texts match.

Representing the characteristics of the CLS identifier in the BERT model as H _CLS And the characteristic representation H of the two texts _s 、H _t Splicing and inputting a predicted similarity score in the two layers of fully connected neural networks:

socre＝sigmoid(FC ₂ (FC ₁ ([H _CLS ；H _s ；H _t ]))) (4)

the FC is provided with ₁ 、FC ₂ Two layers of fully connected neural networks are respectively provided, and sigmoid is an activation function.

With the above method and system, the following source text and target text are input, and through the text matching method, the output similarity is 1 (i.e., the source text is similar to the target text).

Source text:

for 9 months and 14 days, as the plum blossom typhoons move in the northwest direction, the main body of the plum blossom typhoons begins to influence Shanghai. The typhoon plum blossom type subway comprehensive entry combat state is adopted, early warning is carried out in advance, and quick response is carried out. And the station workers are separated into several paths by the stations of the Shanghai subway 3 and No. 4 line Shanghai railway station at noon of 12, and the conditions of flood control plate inspection, flood control station box material checking inspection, water leakage point position confirmation and the like are respectively carried out. In addition, since the stations of the No. 3 and No. 4 offshore railway stations are open-air stations, station staff can move the stations upwards to carry out certain fixing measures, such as garbage cans and the like. The offshore subway indicates that if the wind reaches level 9, a shutdown scheme of the ground overhead line will be adopted. The reporter just gets from the Shanghai subway, in order to cope with typhoon 'plum blossom', measures such as line shrinkage or shutdown are taken on the ground of the Shanghai subway and on an overhead line in the tonight 21, so that the travel of citizens passengers is guaranteed, and the passengers are requested to travel in advance. If typhoon paths or wind power change, overhead ground lines can also limit speed in advance, shrink lines, stop operation and end operation.

Target text:

the typhoon plum blossom type subway comprehensive disaster prevention system is capable of comprehensively entering a temporary combat state, early warning and quick response are performed in advance, train shutdown and passenger evacuation work are performed in time once dangerous situations occur, and the influence and loss of secondary disasters are reduced as much as possible. According to the current typhoon trend, when the current typhoon trend is 21, the subway ground in the Shanghai and the overhead line take measures such as line shrinkage or shutdown, so that the travel of citizens is ensured, and passengers are requested to travel in advance. If typhoon paths or wind power change, overhead ground lines can also limit speed in advance, shrink lines, stop operation and end operation. The passengers of the citizen are reminded that the lines such as the line 3, the line 5, the line 16, the line 17, the line of Pujiang, the magnetic levitation line and the like are stopped at the moment, and the line parts such as the line 1, the line 2, the line 4, the line 6, the line 7, the line 8, the line 9, the line 10 and the line 11 are stopped. In addition, the Shanghai subway adjusts the tomorrow operation plan in real time according to the influence degree of wind power, the operation time of the first class vehicles of each line in the tomorrow is possibly delayed, the train operation period is possibly provided with speed limiting measures, and the operation interval is prolonged. Specifically, passengers pay attention to real-time information pushed by official microblogs of Shanghai subway shmetro, app of Metro general, and the like, and travel paths are timely adjusted.

Whether similar (1 similar/0 dissimilar): 1

Referring to fig. 3, a structure of a text matching system for long text provided by an embodiment of the present invention is shown, where the system includes: sentence-level filter 31, word-level filter 32, vector acquisition module 33, and similarity analysis module 34.

The sentence-level filter 31 receives two text inputs to be matched and extracts a corresponding text keyword for each text to be matched, respectively; word level filter 32 receives text keyword input and performs word level noise filtering and fine granularity matching operations on the text keyword, outputting filtered matched text; the vector acquisition module 33 outputs the position vectors corresponding to the two texts after filtering and matching by using the BERT model, inputs the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtains feature expression vectors of comprehensive contexts of the two texts; the similarity analysis module 34 splices the feature expression vector of the two texts and the feature expression vector of the CLS identifier in the BERT model, and inputs the predicted similarity score in the fully-connected neural network, wherein the similarity score is used as a basis for judging whether the texts are matched.

As an embodiment of the invention, the sentence-level filter 31 includes: the graph model building module 311, the similarity capturing module 312 and the key sentence extracting module 313.

The graph model construction module 311 constructs a graph model through a TextRank algorithm; the similarity capture module 312 captures inter-sentence similarity inside text and inter-sentence similarity between text from the graph model; and the key sentence extraction module 313 extracts text key sentences according to the inter-sentence similarity inside the text and the inter-sentence similarity between the texts.

The graph model construction module 311 specifically includes: sentence input module 3111, sentence combination module 3112, and graphic model generation module 3113.

The sentence input module 3111 inputs source text

And target text->

All sentences are input; the L is ₁ ,L ₂ Respectively are provided withFor the total number of sentences in the source text and the target text, said +.>

Is each sentence in the source text, said +.>

Is each sentence in the target text, said d _s Is a set of source text sentences, said d _t Is a set of target text sentences.

Sentence combination module 3112 combines d _s D _t Obtaining all sentence sets

The graph model generation module 3113 constructs a graph model with sentences in S as vertices and inter-sentence similarity as weights of edges.

The term-level filter 32 performs term-level noise filtering specifically as follows: the word-level filter 32 performs word pruning strategies by fusing the PageRank algorithm and the Attention matrix based on the BERT model, and screens and deletes word-level noise information of the hidden layer.

In summary, the invention filters noise sentences and extracts key sentences from the text to be matched input sentence level filter, then inputs the key sentences into word level filter, utilizes BERT model integrated into PageRank algorithm to mine depth interaction characteristics among texts, and performs word level noise filtering and fine granularity matching operation on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The text matching method provided by the invention is used for deleting noise sentences and noise words in a long text and matching by using the simplified information.

Compared with the prior art, the invention has the following positive effects: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The text matching method for the long text is characterized by comprising the following steps of:

2. The long text-oriented text matching method of claim 1, wherein the step of extracting text keywords by the sentence-level filter specifically comprises:

constructing a graph model through a TextRank algorithm;

3. The long text-oriented text matching method according to claim 2, wherein the step of constructing the graph model by TextRank algorithm specifically comprises:

text of will source

And target text->

Is each sentence in the source text, said +.>

combination d _s D _t Obtaining all sentence sets

4. The long text-oriented text matching method of claim 2,

the calculation mode of the similarity between sentences is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s _i And sentence s _j Inter-similarity sim(s) _i ,s _j ) The calculation formula is shown as formula (1):

5. The long text-oriented text matching method according to claim 2, further comprising the step of, after the step of constructing the graph model by the TextRank algorithm:

6. The long text-oriented text matching method of claim 1, wherein the step of performing word-level noise filtering by the word-level filter specifically comprises:

7. A long text-oriented text matching system, the system comprising:

8. The long text oriented text matching system of claim 7, wherein said sentence level filter comprises:

9. The long text oriented text matching system of claim 8, wherein said graph model building module specifically comprises:

sentence transmissionAn in module for inputting source text

And target text->

Is each sentence in the source text, said +.>

sentence combination module for combining d _s D _t Obtaining all sentence sets

and