CN116306667A - Text matching method and system for long text - Google Patents

Text matching method and system for long text Download PDF

Info

Publication number
CN116306667A
CN116306667A CN202310131234.9A CN202310131234A CN116306667A CN 116306667 A CN116306667 A CN 116306667A CN 202310131234 A CN202310131234 A CN 202310131234A CN 116306667 A CN116306667 A CN 116306667A
Authority
CN
China
Prior art keywords
text
sentence
sentences
similarity
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310131234.9A
Other languages
Chinese (zh)
Inventor
彭程
王佳睿
谢季
刘峰荣
余鸿
任思远
何智毅
陈科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhongke Information Technology Co ltd
Chengdu Information Technology Co Ltd of CAS
Original Assignee
Chengdu Zhongke Information Technology Co ltd
Chengdu Information Technology Co Ltd of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zhongke Information Technology Co ltd, Chengdu Information Technology Co Ltd of CAS filed Critical Chengdu Zhongke Information Technology Co ltd
Priority to CN202310131234.9A priority Critical patent/CN116306667A/en
Publication of CN116306667A publication Critical patent/CN116306667A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text matching method and system for long texts. According to the invention, a text to be matched is input into a sentence-level filter to filter noise sentences and extract key sentences, then the key sentences are input into a word-level filter, deep interaction characteristics among the texts are mined by utilizing a BERT model which is integrated with a PageRank algorithm, and word-level noise filtering and fine granularity matching operations are carried out on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The invention has the positive effects that: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.

Description

Text matching method and system for long text
Technical Field
The invention relates to the technical field of computers, in particular to a text matching method and system for long texts.
Background
Text matching is used as a key task in natural language processing applications such as community questions and answers, information retrieval and dialogue systems and the like, and aims to analyze and judge semantic association between source text and target text. The long text matching is an important sub-direction in the text matching field, can rapidly judge the relationship between two chapters, and can identify whether the topic expressions of the two chapters are similar or not, thereby having great research and application values.
The text matching model has two solution ideas: a traditional model and a depth model. Solving thought of traditional model: the text matching degree is measured by manually defining and extracting the features and using the extracted features. The method has the problems of high cost of manually extracting the features, incomplete feature extraction and the like. Moreover, the method is essentially a surface layer matching method, and cannot realize a deeper semantic matching task.
Solution thinking based on depth model: and coding the text by utilizing the strong language characterization capability of the deep neural network, mining the text deep semantic information, and performing matching operation in a semantic space. The method can achieve higher accuracy without manually setting the characteristics. Currently, most depth models are designed for short text (i.e., short text depth matching models), and long inter-text fine-grained matching signals are typically sparse. When long text is matched by using a short text depth matching model, it is difficult to identify a matching signal from a large number of noise signals, resulting in an unsatisfactory matching effect.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a text matching method and a system for long texts.
The technical scheme adopted for solving the technical problems is as follows: a text matching method for long text, the text matching method comprising the steps of:
inputting two texts to be matched into sentence-level filters, wherein the sentence-level filters respectively extract corresponding text key sentences for each text to be matched;
inputting the text key sentence into a word level filter, performing word level noise filtering and fine granularity matching operation by the word level filter, and outputting a text after filtering and matching;
outputting the position vectors corresponding to the two filtered and matched texts by using a BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts;
and splicing the characteristic expression vectors of the two texts and the characteristic expression vector of the CLS identifier in the BERT model, and inputting the characteristic expression vector into a fully-connected neural network to predict a similarity score, wherein the similarity score is used as a basis for judging whether the texts are matched.
Further, the step of extracting text key sentences by the sentence-level filter specifically includes:
constructing a graph model through a TextRank algorithm;
the graph model captures the inter-sentence similarity inside the text and the inter-sentence similarity between the texts;
and extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.
Further, the step of constructing the graph model through the TextRank algorithm specifically includes:
text of will source
Figure BDA0004083944960000021
And target text->
Figure BDA0004083944960000022
All sentences input, the L 1 ,L 2 Total number of sentences in the source text and the target text, respectively, said +.>
Figure BDA0004083944960000023
Is each sentence in the source text, said +.>
Figure BDA0004083944960000024
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a target text sentence collection;
combination d s D t Obtaining all sentence sets
Figure BDA0004083944960000031
And (3) taking sentences in the S as vertexes, taking the similarity among the sentences as the weight of the edge, and constructing a graph model.
Further, the calculation mode of the similarity between sentences is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s i And sentence s j Inter-similarity sim(s) i ,s j ) The calculation formula is shown as formula (1):
Figure BDA0004083944960000032
the s is i Sum s j I and j sentences, respectively, the sim (s i ,s j ) Is s i Sum s j Similarity between w k At s i Sum s j Simultaneously occurring words of (a).
Further, after the step of constructing the graph model by the TextRank algorithm, the method further comprises the steps of:
scoring sentences by using a TextRank algorithm, and extracting text key sentences according to the scoring size, wherein the sentences s i Score value W(s) i ) The iteration is obtained by the formula (2):
Figure BDA0004083944960000033
the W(s) i ) For sentence s i Weight value of the W(s) j ) For sentence s j D is a damping coefficient representing a probability of pointing from a certain node to any other node in the graph, and is typically set to 0.85, and s is i 、s j Sum s k Are all sentences in the sentence collection, the sim (s i ,s j ) Is s i Sum s j Similarity between the sim(s) j ,s k ) Is s j Sum s k Similarity between the two.
Further, the step of filtering the word-level noise by the word-level filter specifically includes:
the word level filter is based on a BERT model, performs word deletion strategies by fusing a PageRank algorithm and an Attention matrix, and screens and deletes word level noise information of a hidden layer.
It is another object of the present invention to provide a text matching system for long text, the system comprising:
the sentence-level filter is used for receiving two text inputs to be matched and respectively extracting corresponding text key sentences for each text to be matched;
the word level filter is used for receiving text key sentence input, carrying out word level noise filtering and fine granularity matching operation on the text key sentence, and outputting a text after filtering and matching;
the vector acquisition module is used for outputting position vectors corresponding to the two filtered and matched texts by utilizing the BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts; and
and the similarity analysis module is used for splicing the characteristic representation vectors of the two texts and the characteristic representation vector of the CLS identifier in the BERT model, inputting the characteristic representation vector into the fully-connected neural network to predict a similarity score, and taking the similarity score as a basis for judging whether the texts are matched.
Further, the sentence-level filter includes:
the diagram model construction module is used for constructing a diagram model through a TextRank algorithm;
the similarity capturing module is used for capturing the inter-sentence similarity in the texts and the inter-sentence similarity between the texts by the graph model; and
and the key sentence extraction module is used for extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.
Further, the graph model construction module specifically includes:
sentence input module for inputting source text
Figure BDA0004083944960000041
And target text->
Figure BDA0004083944960000042
All sentences are input; the L is 1 ,L 2 Total number of sentences in the source text and the target text, respectively, said +.>
Figure BDA0004083944960000043
Is each sentence in the source text, said +.>
Figure BDA0004083944960000044
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a target text sentence collection;
sentence combination module for combining d s D t Obtaining all sentence sets
Figure BDA0004083944960000045
and
And the graph model generation module is used for constructing a graph model by taking sentences in the S as vertexes and the similarity among the sentences as the weight of the edge.
According to the invention, a text to be matched is input into a sentence-level filter to filter noise sentences and extract key sentences, then the key sentences are input into a word-level filter, deep interaction characteristics among the texts are mined by utilizing a BERT model which is integrated with a PageRank algorithm, and word-level noise filtering and fine granularity matching operations are carried out on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The text matching method provided by the invention is used for deleting noise sentences and noise words in a long text and matching by using the simplified information. Compared with the prior art, the invention has the following positive effects: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.
Drawings
The invention will now be described by way of example and with reference to the accompanying drawings in which:
FIG. 1 is a flowchart of a text matching method for long text provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a text matching method according to an embodiment of the present invention;
fig. 3 is a block diagram of a text matching system for long text according to an embodiment of the present invention.
Detailed Description
The invention will now be described in further detail with reference to specific examples thereof in connection with the accompanying drawings.
Referring to fig. 1 and fig. 2, fig. 1 shows a flow of a text matching method for long text provided in an embodiment of the present invention, and details are as follows:
in step S101, two texts to be matched are input into sentence-level filters, which extract corresponding text key sentences for each text to be matched, respectively.
As an embodiment of the present invention, the step of extracting text key sentences by the sentence-level filter specifically includes:
constructing a graph model through a TextRank algorithm;
the graph model captures the inter-sentence similarity inside the text and the inter-sentence similarity between the texts;
and extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.
As an embodiment of the present invention, the step of constructing the graph model by the TextRank algorithm specifically includes:
two texts are combined
Figure BDA0004083944960000061
And->
Figure BDA0004083944960000062
All sentences are input; the L is 1 ,L 2 Total number of sentences in the source text and the target text, respectively, said +.>
Figure BDA0004083944960000063
Is each sentence in the source text, said
Figure BDA0004083944960000064
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a target text sentence collection;
combination d s D t Obtaining the obtainedWith collection of sentences
Figure BDA0004083944960000065
And (3) taking sentences in the S as vertexes, taking the similarity among the sentences as the weight of the edge, and constructing a graph model.
The calculation mode of the similarity between sentences (including the sentence similarity inside the text and the sentence similarity between the texts) is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s i Sum s j Inter-similarity sim(s) i ,s j ) The calculation formula is shown as formula (1):
Figure BDA0004083944960000066
the s is i Sum s j I and j sentences, respectively, the sim (s i ,s j ) Is s i Sum s j Similarity between w k At s i Sum s j Simultaneously occurring words of (a).
The method further comprises the following steps of:
scoring sentences by using a TextRank algorithm, and extracting text key sentences according to the scoring size, wherein the sentences s i Score value W(s) i ) The iteration is obtained by the formula (2):
Figure BDA0004083944960000071
the W(s) i ) For sentence s i Weight value of the W(s) j ) For sentence s j D is a damping coefficient representing a probability of pointing from a certain node to any other node in the graph, and is typically set to 0.85, and s is i 、s j Sum s k Are all sentences in the sentence collection, the sim (s i ,s j ) Is s i Sum s j Similarity between the sim(s) j ,s k ) Is s j Sum s k Similarity between the two.
In step S102, the text key sentence is input into a word-level filter, and the word-level filter performs word-level noise filtering and fine granularity matching operations, and outputs a text after filtering and matching.
The step of filtering the word level noise by the word level filter specifically comprises the following steps:
the word level filter is based on a BERT model, a PageRank algorithm and an Attention matrix are fused to execute a word deletion strategy, and word level noise information of a hidden layer is screened and deleted, so that a text after filtering and matching is obtained.
As an embodiment of the present invention, word-level filters: and inputting the key sentences of the text into the BERT model to mine fine-granularity interaction semantic information of the key sentences and the key sentences, and scoring the hidden layer word nodes of the BERT by using an attention matrix A and a PageRank algorithm. The scoring numerical value is divided into two parts, wherein the first part firstly builds a graph model on the BERT hidden layer node, the attention matrix A is regarded as an adjacent matrix in the PageRank, then the PageRank algorithm is utilized for iteration, the node importance numerical value u is obtained after convergence, and the t iteration formula is shown as (3):
u t+1 =d(A l ) T u t +(1-d)/N·Ι (3)
the u is t The node weight after the t-th iteration is the A l The attention matrix is the attention matrix of the first layer, d is a damping coefficient, N is the number of nodes in the graph, and I is a unit matrix.
Then taking the attention moment array as a weight matrix A, and multiplying the weight matrix A by u to obtain a scoring value R=Au of the first part; the second part directly regards the matrix P obtained by summing the matrix A according to the columns as the initial importance score of the word nodes, takes the attention moment matrix as the weight matrix, and multiplies the attention moment matrix to obtain a score value R under the action of an initialization value * =ap. Linearly summing the scoring values of the two parts to obtain a final scoring value R final =αR * ++ (1-. Alpha.) R, and R final Is based on deleting hidden nodes with lower scores.
In step S103, a BERT model is used to output two position vectors corresponding to the text after filtering and matching, the position vectors are input into a 1-dimensional convolutional neural network to perform text semantic feature extraction, and feature expression vectors of comprehensive contexts of the two texts are obtained.
In the output layer of BERT model, the position vector H of each text is obtained by using the position of SEP identifier as separation s =[h 1 ,h 2 ...,h SEP-1 ],H t =[h SEP+1 ,h SEP+2 ...,h N ]. Respectively inputting the two into a 1-dimensional convolutional neural network to model the text context to obtain feature expression vectors H of the two texts s =Conv(H s )、H t =Conv(H t )。
In step S104, the feature expression vectors of the two texts and the feature expression vector of the CLS identifier in the BERT model are spliced, and a predicted similarity score in the fully connected neural network is input, where the similarity score is used as a basis for determining whether the texts match.
Representing the characteristics of the CLS identifier in the BERT model as H CLS And the characteristic representation H of the two texts s 、H t Splicing and inputting a predicted similarity score in the two layers of fully connected neural networks:
socre=sigmoid(FC 2 (FC 1 ([H CLS ;H s ;H t ]))) (4)
the FC is provided with 1 、FC 2 Two layers of fully connected neural networks are respectively provided, and sigmoid is an activation function.
With the above method and system, the following source text and target text are input, and through the text matching method, the output similarity is 1 (i.e., the source text is similar to the target text).
Source text:
for 9 months and 14 days, as the plum blossom typhoons move in the northwest direction, the main body of the plum blossom typhoons begins to influence Shanghai. The typhoon plum blossom type subway comprehensive entry combat state is adopted, early warning is carried out in advance, and quick response is carried out. And the station workers are separated into several paths by the stations of the Shanghai subway 3 and No. 4 line Shanghai railway station at noon of 12, and the conditions of flood control plate inspection, flood control station box material checking inspection, water leakage point position confirmation and the like are respectively carried out. In addition, since the stations of the No. 3 and No. 4 offshore railway stations are open-air stations, station staff can move the stations upwards to carry out certain fixing measures, such as garbage cans and the like. The offshore subway indicates that if the wind reaches level 9, a shutdown scheme of the ground overhead line will be adopted. The reporter just gets from the Shanghai subway, in order to cope with typhoon 'plum blossom', measures such as line shrinkage or shutdown are taken on the ground of the Shanghai subway and on an overhead line in the tonight 21, so that the travel of citizens passengers is guaranteed, and the passengers are requested to travel in advance. If typhoon paths or wind power change, overhead ground lines can also limit speed in advance, shrink lines, stop operation and end operation.
Target text:
the typhoon plum blossom type subway comprehensive disaster prevention system is capable of comprehensively entering a temporary combat state, early warning and quick response are performed in advance, train shutdown and passenger evacuation work are performed in time once dangerous situations occur, and the influence and loss of secondary disasters are reduced as much as possible. According to the current typhoon trend, when the current typhoon trend is 21, the subway ground in the Shanghai and the overhead line take measures such as line shrinkage or shutdown, so that the travel of citizens is ensured, and passengers are requested to travel in advance. If typhoon paths or wind power change, overhead ground lines can also limit speed in advance, shrink lines, stop operation and end operation. The passengers of the citizen are reminded that the lines such as the line 3, the line 5, the line 16, the line 17, the line of Pujiang, the magnetic levitation line and the like are stopped at the moment, and the line parts such as the line 1, the line 2, the line 4, the line 6, the line 7, the line 8, the line 9, the line 10 and the line 11 are stopped. In addition, the Shanghai subway adjusts the tomorrow operation plan in real time according to the influence degree of wind power, the operation time of the first class vehicles of each line in the tomorrow is possibly delayed, the train operation period is possibly provided with speed limiting measures, and the operation interval is prolonged. Specifically, passengers pay attention to real-time information pushed by official microblogs of Shanghai subway shmetro, app of Metro general, and the like, and travel paths are timely adjusted.
Whether similar (1 similar/0 dissimilar): 1
Referring to fig. 3, a structure of a text matching system for long text provided by an embodiment of the present invention is shown, where the system includes: sentence-level filter 31, word-level filter 32, vector acquisition module 33, and similarity analysis module 34.
The sentence-level filter 31 receives two text inputs to be matched and extracts a corresponding text keyword for each text to be matched, respectively; word level filter 32 receives text keyword input and performs word level noise filtering and fine granularity matching operations on the text keyword, outputting filtered matched text; the vector acquisition module 33 outputs the position vectors corresponding to the two texts after filtering and matching by using the BERT model, inputs the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtains feature expression vectors of comprehensive contexts of the two texts; the similarity analysis module 34 splices the feature expression vector of the two texts and the feature expression vector of the CLS identifier in the BERT model, and inputs the predicted similarity score in the fully-connected neural network, wherein the similarity score is used as a basis for judging whether the texts are matched.
As an embodiment of the invention, the sentence-level filter 31 includes: the graph model building module 311, the similarity capturing module 312 and the key sentence extracting module 313.
The graph model construction module 311 constructs a graph model through a TextRank algorithm; the similarity capture module 312 captures inter-sentence similarity inside text and inter-sentence similarity between text from the graph model; and the key sentence extraction module 313 extracts text key sentences according to the inter-sentence similarity inside the text and the inter-sentence similarity between the texts.
The graph model construction module 311 specifically includes: sentence input module 3111, sentence combination module 3112, and graphic model generation module 3113.
The sentence input module 3111 inputs source text
Figure BDA0004083944960000101
And target text->
Figure BDA0004083944960000102
All sentences are input; the L is 1 ,L 2 Respectively are provided withFor the total number of sentences in the source text and the target text, said +.>
Figure BDA0004083944960000103
Is each sentence in the source text, said +.>
Figure BDA0004083944960000104
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a set of target text sentences.
Sentence combination module 3112 combines d s D t Obtaining all sentence sets
Figure BDA0004083944960000105
The graph model generation module 3113 constructs a graph model with sentences in S as vertices and inter-sentence similarity as weights of edges.
The term-level filter 32 performs term-level noise filtering specifically as follows: the word-level filter 32 performs word pruning strategies by fusing the PageRank algorithm and the Attention matrix based on the BERT model, and screens and deletes word-level noise information of the hidden layer.
In summary, the invention filters noise sentences and extracts key sentences from the text to be matched input sentence level filter, then inputs the key sentences into word level filter, utilizes BERT model integrated into PageRank algorithm to mine depth interaction characteristics among texts, and performs word level noise filtering and fine granularity matching operation on the key sentences. Finally, the predicted text pair relation is represented by splicing vectors at different positions of the BERT model. The text matching method provided by the invention is used for deleting noise sentences and noise words in a long text and matching by using the simplified information.
Compared with the prior art, the invention has the following positive effects: (1) Compared with the training of inputting all contents of a long text into a model without deleting, the method has the advantages that the length of the text can be effectively reduced by deleting noise sentences, and useless information is removed; (2) The noise words are deleted in the BERT, so that the model focuses on beneficial fine granularity matching signals, and the matching precision is higher; (3) And the semantic information after two text codes is fully utilized for predicting tasks by combining different position vector representations in BERT output, so that the matching accuracy is higher.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The text matching method for the long text is characterized by comprising the following steps of:
inputting two texts to be matched into sentence-level filters, wherein the sentence-level filters respectively extract corresponding text key sentences for each text to be matched;
inputting the text key sentence into a word level filter, performing word level noise filtering and fine granularity matching operation by the word level filter, and outputting a text after filtering and matching;
outputting the position vectors corresponding to the two filtered and matched texts by using a BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts;
and splicing the characteristic expression vectors of the two texts and the characteristic expression vector of the CLS identifier in the BERT model, and inputting the characteristic expression vector into a fully-connected neural network to predict a similarity score, wherein the similarity score is used as a basis for judging whether the texts are matched.
2. The long text-oriented text matching method of claim 1, wherein the step of extracting text keywords by the sentence-level filter specifically comprises:
constructing a graph model through a TextRank algorithm;
the graph model captures the inter-sentence similarity inside the text and the inter-sentence similarity between the texts;
and extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.
3. The long text-oriented text matching method according to claim 2, wherein the step of constructing the graph model by TextRank algorithm specifically comprises:
text of will source
Figure FDA0004083944950000011
And target text->
Figure FDA0004083944950000012
All sentences input, the L 1 ,L 2 Total number of sentences in the source text and the target text, respectively, said +.>
Figure FDA0004083944950000013
Is each sentence in the source text, said +.>
Figure FDA0004083944950000014
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a target text sentence collection;
combination d s D t Obtaining all sentence sets
Figure FDA0004083944950000021
And (3) taking sentences in the S as vertexes, taking the similarity among the sentences as the weight of the edge, and constructing a graph model.
4. The long text-oriented text matching method of claim 2,
the calculation mode of the similarity between sentences is as follows: calculating the proportion of the number of co-occurring words to the total number of sentence words, and the sentence s i And sentence s j Inter-similarity sim(s) i ,s j ) The calculation formula is shown as formula (1):
Figure FDA0004083944950000022
the s is i Sum s j I and j sentences, respectively, the sim (s i ,s j ) Is s i Sum s j Similarity between w k At s i Sum s j Simultaneously occurring words of (a).
5. The long text-oriented text matching method according to claim 2, further comprising the step of, after the step of constructing the graph model by the TextRank algorithm:
scoring sentences by using a TextRank algorithm, and extracting text key sentences according to the scoring size, wherein the sentences s i Score value W(s) i ) The iteration is obtained by the formula (2):
Figure FDA0004083944950000023
the W(s) i ) For sentence s i Weight value of the W(s) j ) For sentence s j D is a damping coefficient representing a probability of pointing from a certain node to any other node in the graph, and is typically set to 0.85, and s is i 、s j Sum s k Are all sentences in the sentence collection, the sim (s i ,s j ) Is s i Sum s j Similarity between the sim(s) j ,s k ) Is s j Sum s k Similarity between the two.
6. The long text-oriented text matching method of claim 1, wherein the step of performing word-level noise filtering by the word-level filter specifically comprises:
the word level filter is based on a BERT model, performs word deletion strategies by fusing a PageRank algorithm and an Attention matrix, and screens and deletes word level noise information of a hidden layer.
7. A long text-oriented text matching system, the system comprising:
the sentence-level filter is used for receiving two text inputs to be matched and respectively extracting corresponding text key sentences for each text to be matched;
the word level filter is used for receiving text key sentence input, carrying out word level noise filtering and fine granularity matching operation on the text key sentence, and outputting a text after filtering and matching;
the vector acquisition module is used for outputting position vectors corresponding to the two filtered and matched texts by utilizing the BERT model, inputting the position vectors into a 1-dimensional convolutional neural network for text semantic feature extraction, and obtaining feature expression vectors of comprehensive contexts of the two texts; and
and the similarity analysis module is used for splicing the characteristic representation vectors of the two texts and the characteristic representation vector of the CLS identifier in the BERT model, inputting the characteristic representation vector into the fully-connected neural network to predict a similarity score, and taking the similarity score as a basis for judging whether the texts are matched.
8. The long text oriented text matching system of claim 7, wherein said sentence level filter comprises:
the diagram model construction module is used for constructing a diagram model through a TextRank algorithm;
the similarity capturing module is used for capturing the inter-sentence similarity in the texts and the inter-sentence similarity between the texts by the graph model; and
and the key sentence extraction module is used for extracting text key sentences according to the inter-sentence similarity in the text and the inter-sentence similarity between the texts.
9. The long text oriented text matching system of claim 8, wherein said graph model building module specifically comprises:
sentence transmissionAn in module for inputting source text
Figure FDA0004083944950000031
And target text->
Figure FDA0004083944950000032
All sentences are input; the L is 1 ,L 2 Total number of sentences in the source text and the target text, respectively, said +.>
Figure FDA0004083944950000033
Is each sentence in the source text, said +.>
Figure FDA0004083944950000034
Is each sentence in the target text, said d s Is a set of source text sentences, said d t Is a target text sentence collection;
sentence combination module for combining d s D t Obtaining all sentence sets
Figure FDA0004083944950000041
and
And the graph model generation module is used for constructing a graph model by taking sentences in the S as vertexes and the similarity among the sentences as the weight of the edge.
CN202310131234.9A 2023-02-17 2023-02-17 Text matching method and system for long text Pending CN116306667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310131234.9A CN116306667A (en) 2023-02-17 2023-02-17 Text matching method and system for long text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310131234.9A CN116306667A (en) 2023-02-17 2023-02-17 Text matching method and system for long text

Publications (1)

Publication Number Publication Date
CN116306667A true CN116306667A (en) 2023-06-23

Family

ID=86782621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310131234.9A Pending CN116306667A (en) 2023-02-17 2023-02-17 Text matching method and system for long text

Country Status (1)

Country Link
CN (1) CN116306667A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194614A (en) * 2023-11-02 2023-12-08 北京中电普华信息技术有限公司 Text difference recognition method, device and computer readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194614A (en) * 2023-11-02 2023-12-08 北京中电普华信息技术有限公司 Text difference recognition method, device and computer readable medium
CN117194614B (en) * 2023-11-02 2024-01-30 北京中电普华信息技术有限公司 Text difference recognition method, device and computer readable medium

Similar Documents

Publication Publication Date Title
CN109710769A (en) A kind of waterborne troops's comment detection system and method based on capsule network
CN109726293A (en) A kind of causal event map construction method, system, device and storage medium
CN113392986B (en) Highway bridge information extraction method based on big data and management maintenance system
CN105427869A (en) Session emotion autoanalysis method based on depth learning
CN108932278B (en) Man-machine conversation method and system based on semantic framework
CN109543764B (en) Early warning information validity detection method and detection system based on intelligent semantic perception
CN112434532B (en) Power grid environment model supporting man-machine bidirectional understanding and modeling method
WO2023108991A1 (en) Model training method and apparatus, knowledge classification method and apparatus, and device and medium
CN113312530B (en) Multi-mode emotion classification method taking text as core
CN103942191A (en) Horrific text recognizing method based on content
CN104572614A (en) Training method and system for language model
CN116306667A (en) Text matching method and system for long text
CN112883286A (en) BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation
CN109871449A (en) A kind of zero sample learning method end to end based on semantic description
CN109614612A (en) A kind of Chinese text error correction method based on seq2seq+attention
CN110888989A (en) Intelligent learning platform and construction method thereof
CN113656564A (en) Power grid service dialogue data emotion detection method based on graph neural network
CN112528642A (en) Implicit discourse relation automatic identification method and system
CN106021413A (en) Theme model based self-extendable type feature selecting method and system
Qiu et al. NeuroSPE: A neuro‐net spatial relation extractor for natural language text fusing gazetteers and pretrained models
Zhao et al. Tibetan multi-dialect speech recognition using latent regression Bayesian network and end-to-end mode
Zhang et al. Research on spectrum sensing system based on composite neural network
CN117252255A (en) Disaster emergency knowledge graph construction method oriented to auxiliary decision
CN115983383A (en) Entity relationship extraction method and related device for power equipment
CN106844448A (en) A kind of recognition methods of Chinese event fact and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination