CN117131153A - Text matching method, device, system and storage medium - Google Patents

Text matching method, device, system and storage medium Download PDF

Info

Publication number
CN117131153A
CN117131153A CN202310928224.8A CN202310928224A CN117131153A CN 117131153 A CN117131153 A CN 117131153A CN 202310928224 A CN202310928224 A CN 202310928224A CN 117131153 A CN117131153 A CN 117131153A
Authority
CN
China
Prior art keywords
text
sentence group
text sentence
segmented
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310928224.8A
Other languages
Chinese (zh)
Inventor
林乐平
石玉博
蔡晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202310928224.8A priority Critical patent/CN117131153A/en
Publication of CN117131153A publication Critical patent/CN117131153A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text matching method, a device, a system and a storage medium, belonging to the field of text matching, wherein the method comprises the following steps: importing an original text data set, and performing word segmentation processing on the original text data set to obtain a plurality of segmented text sentences; dividing all segmented text sentences into groups to obtain a plurality of segmented text sentence groups; respectively updating each word-segmented text sentence group to obtain a target text sentence group; and respectively predicting each target text sentence group to obtain a text matching result. The method improves the effect of feature extraction, enhances the data efficiency and the generalization capability of the model, reduces the calculated amount, better saves the training cost, can better acquire the global information of sentences, and solves the problems of long-range dependence and the like in long text matching.

Description

Text matching method, device, system and storage medium
Technical Field
The application mainly relates to the technical field of text matching, in particular to a text matching method, a device, a system and a storage medium.
Background
The existing text matching method is generally difficult to better model long text, and when a long text matching task is processed, the model needs to consider more semantic information and context. This results in an increase in computational complexity of the model and a slow speed of training and reasoning. Meanwhile, since long text typically has a longer sequence length, the model may be limited by sequence truncation, resulting in some important semantic information loss.
Disclosure of Invention
The application aims to solve the technical problem of providing a text matching method, a device, a system and a storage medium aiming at the defects of the prior art.
The technical scheme for solving the technical problems is as follows: a text matching method comprising the steps of:
importing an original text data set, and performing word segmentation on the original text data set to obtain a plurality of segmented text sentences;
dividing all the text sentences subjected to word segmentation into a group two by two to obtain a plurality of text sentence groups subjected to word segmentation;
updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
The other technical scheme for solving the technical problems is as follows: a text matching device, comprising:
an importing module for importing an original text data set;
the word segmentation processing module is used for carrying out word segmentation processing on the original text data set to obtain a plurality of segmented text sentences;
the grouping module is used for grouping all the text sentences subjected to word segmentation into one group in pairs to obtain a plurality of text sentence groups subjected to word segmentation;
the updating module is used for updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and the text matching result obtaining module is used for respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
Based on the text matching method, the application further provides a text matching system.
The other technical scheme for solving the technical problems is as follows: a text matching system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements a text matching method as described above.
Based on the text matching method, the application further provides a computer readable storage medium.
The other technical scheme for solving the technical problems is as follows: a computer readable storage medium storing a computer program which, when executed by a processor, implements a text matching method as described above.
The beneficial effects of the application are as follows: the segmented text sentences are obtained by word segmentation processing of the original text data set, the segmented text sentences are divided into a group of two pairs to obtain segmented text sentence groups, the segmented text sentence groups are updated to obtain target text sentence groups, and text matching results are obtained by prediction of the target text sentence groups, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of a model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.
Drawings
Fig. 1 is a schematic flow chart of a text matching method according to an embodiment of the present application;
fig. 2 is a block diagram of a text matching device according to an embodiment of the present application.
Detailed Description
The principles and features of the present application are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the application and are not to be construed as limiting the scope of the application.
Fig. 1 is a schematic flow chart of a text matching method according to an embodiment of the present application.
As shown in fig. 1, a text matching method includes the following steps:
importing an original text data set, and performing word segmentation on the original text data set to obtain a plurality of segmented text sentences;
dividing all the text sentences subjected to word segmentation into a group two by two to obtain a plurality of text sentence groups subjected to word segmentation;
updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
In the above embodiment, the word segmentation processing is performed on the original text data set to obtain the segmented text sentences, the segmented text sentences are divided into one group in pairs to obtain the segmented text sentence group, the updating of the segmented text sentence group is performed to obtain the target text sentence group, and the prediction of the target text sentence group is performed to obtain the text matching result, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of the model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, the global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.
Optionally, as an embodiment of the present application, the process of performing word segmentation on the original text data set to obtain a plurality of segmented text sentences includes:
and performing word segmentation processing on the original text data set by utilizing a jieba word segmentation library to obtain a plurality of segmented text sentences.
It should be appreciated that the original dataset is first pre-processed, the dataset is segmented using the python kit jieba (i.e., the jieba word segmentation library), words in sentences are separated from word to word by spaces, and a dictionary file is created.
Specifically, the jieba word stock is a popular Chinese word stock for segmenting Chinese text into individual words. The method is an open-source project, has the characteristics of easy use and high performance, is widely applied to Chinese natural language processing tasks, is a powerful Chinese word segmentation library, provides a simple and easy-to-use interface and a plurality of word segmentation modes, and can play an important role in Chinese text processing. No matter in the tasks of information retrieval, text classification, emotion analysis and the like, jieba can provide a reliable solution for the word segmentation requirement of Chinese text.
In the embodiment, the jieba word segmentation library is utilized to perform word segmentation processing on the original text data set to obtain a plurality of segmented text sentences, dictionary files can be established, a foundation is laid for subsequent data processing, data efficiency and generalization capability of a model are enhanced, meanwhile, calculation amount is reduced, and training cost is better saved.
Optionally, as an embodiment of the present application, the process of updating each of the word-segmented text sentence groups to obtain a target text sentence group corresponding to each of the word-segmented text sentence groups includes:
respectively carrying out vectorization processing on each segmented text sentence group through a pre-training model SimBERT to obtain hidden layer text vectors corresponding to each segmented text sentence group;
vector updating is carried out on each hidden layer text vector respectively to obtain updated text vectors corresponding to each word-segmented text sentence group;
and respectively splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group to obtain a target text sentence group corresponding to each word-segmented text sentence group.
It should be appreciated that the pre-trained model SimBERT is a BERT model based on Microsoft design, based on UniLM ideas, combining the search and generation tasks to further fine tune the model. It has similar question generation and similar sentence retrieval capabilities 12. It is used in various applications, such as generating synonyms 2 and text similarity retrieval 3.
Specifically, sentence 1 and sentence 2 (i.e., the segmented text sentence set) are imported into the pre-training model SimBERT, each sentence containing 3 parts of position vectors, segment vectors, and word vectors, respectively. And distinguishing different sentences by using marks [ CLS ] and [ SEP ], wherein [ CLS ] represents specific symbols which are output in a distinguishing way, and [ SEP ] represents specific symbols which separate discontinuous token sequences, and meanwhile, the position information of each sentence is kept; the word vector is the word vector corresponding to each token in the input sentence. The characteristic of sharing weight by the twin network is utilized. The results are introduced into a transducer encoder, and each token is represented by a bi-directional encoded result. The transducer encoder comprises a self-attention layer, a residual layer, a normalization layer and a feedforward neural network layer. The encoder takes the superimposed character-level vector as input and finally gets the hidden layer vector with semantic information, the last layer output of the pre-trained model, which contains [ CLS ] and [ SEP ]. A hidden layer vector (P, Q) (i.e., a hidden layer text vector) is obtained.
In the above embodiment, the text sentence groups after each word segmentation are updated respectively to obtain the target text sentence groups, so that global information and context relation of sentences are better extracted, weight sharing among sentences is realized, and the problem that the global information and context relation of sentences are difficult to capture in the matching process is solved.
Optionally, as an embodiment of the present application, the process of updating the vector of each hidden layer text vector to obtain updated text vectors corresponding to each segmented text sentence group includes:
extracting global word sense from each hidden layer text vector through a Bi-LSTM model to obtain global word sense vectors corresponding to each segmented text sentence group, wherein each global word sense vector comprises a plurality of global word sense nodes;
node updating is carried out on a plurality of global word sense nodes corresponding to each segmented text sentence group respectively, so that a plurality of updated global word sense nodes corresponding to each segmented text sentence group are obtained;
performing maximum pooling processing on a plurality of updated global word sense nodes corresponding to each segmented text sentence group through a first formula to obtain updated text vectors corresponding to each segmented text sentence group, wherein the first formula is as follows:
wherein conv x For the updated text vector corresponding to the x-th word-segmented text sentence group, max pool is the maximum pooling function, reLU is the activation function,and (3) the global word sense node after the i-th update corresponding to the text sentence group after the x-th word segmentation is used, U is a weight matrix, and b is a bias.
It should be appreciated that the output (P, Q) of the upper network (i.e., hidden layer text vector) is fed into the Bi-LSTM layer (i.e., bi-LSTM model) to obtain global word sense information as follows:
wherein the method comprises the steps of
It should be appreciated that the Bi-LSTM model, i.e., the two-way long and short term memory network (Bidirectional LSTM, biLSTM for short), is a cyclic neural network (RNN) based model. Compared with the traditional unidirectional LSTM model, the BiLSTM can simultaneously consider historical information and future information, so that modeling capability of the model on sequence data is improved. BiLSTM obtains a final output by inputting an input sequence into two LSTM layers in time order and reverse time order, respectively, and splicing their outputs along the time axis. In this way, the model is able to extract features from both past and future contexts and is better able to capture long-term dependencies in the sequence data. BiLSTM is widely used in fields of natural language processing, audio signal processing, handwriting recognition and the like, and particularly when tasks such as classification, labeling, generation and the like of sequence data are required, biLSTM has become a common model.
Specifically, DPCNN is mainly composed of a retransmission embedding layer (text region embedding layer), two convolution block blocks (each block is composed of two conv convolution functions with fixed convolution kernels of 3) (layers constructed by two blocks can be directly connected through pre-activation), and a Repeat structure, and a Max-poling layer is added before conv and after pre-activation.
Specifically, the two re-calculated vectors of node information (i.e., updated global sense nodes) are calculated as follows:
where function f represents the activation function ReLU, U represents the weight matrix, and b represents the bias. The maximum pooling of two sentences (i.e., updated text vectors) is calculated as follows:
conv=max pool(X,X 1 )。
in the above embodiment, the updated text vectors are obtained by respectively updating the vectors of the hidden layer text vectors, so that global information of the text can be captured better, the problem of gradient disappearance can be reduced, and the feature representation can be learned more effectively, thereby being beneficial to extracting the context relation.
Optionally, as an embodiment of the present application, the process of updating nodes of a plurality of global word sense nodes corresponding to each segmented text sentence group to obtain a plurality of updated global word sense nodes corresponding to each segmented text sentence group includes:
calculating the attention coefficients of each global word sense node and the rest global word sense nodes respectively through a second formula to obtain a plurality of attention coefficients corresponding to each segmented text sentence group, wherein the second formula is as follows:
wherein (alpha) ij ) x For the attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group, the LeakyReLU is an activation function,for the self-attention matrix, W is the weight matrix,>for the ith global word sense node corresponding to the xth sentence group after the xth word segmentation,/for the text sentence group after the xth word segmentation>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>For the kth global word sense node corresponding to the xth word-segmented text sentence group,/for the (x) th global word sense node>Global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group;
and respectively carrying out node update calculation on a plurality of attention coefficients corresponding to each segmented text sentence group and a plurality of global word sense nodes corresponding to each segmented text sentence group through a third formula to obtain a plurality of updated global word sense nodes corresponding to each segmented text sentence group, wherein the third formula is as follows:
wherein,an ith updated global word sense node corresponding to the xth word-segmented text sentence group, (alpha) ij ) x Attention coefficients of an ith global word sense node and a jth global word sense node corresponding to an xth word-divided text sentence group are +.>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>For the jth global word sense node corresponding to the xth word-segmented text sentence group, sigma is an activation function,>and W is a weight matrix for global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group.
It should be appreciated that the vector of outputs of the upper network will be(i.e. a plurality of global word sensesNode) is used as a layer input of GAT, wherein nodes are also needed, n represents the number of nodes, and F represents the number of features of each node.
Specifically, the attention scores of the sentence word vector center node and the neighbor nodes are calculated as follows:
where ij represents two neighboring nodes, w is a weight matrix for training the nodes,then two word vectors in a sentence are represented. Self-attention, a shared-attention mechanism a, is then performed on the node as follows:
introducing softmax regularizes all i adjacent nodes j, as follows:
the attention mechanism a is a single layer feed forward neural network, after which the following formula is introduced:
wherein LeaKyReLU is a nonlinear activation function, as follows:
it should be appreciated that the output of the calculated new node information (i.e., updated global sense nodes) is derived separately as follows:
in the above embodiment, node updating is performed on the plurality of global word sense nodes respectively to obtain updated global word sense nodes, so that global information of a text can be captured better, the problem of gradient disappearance can be solved, and feature representation can be learned more effectively, thereby being beneficial to extracting a context relation.
Optionally, as an embodiment of the present application, the process of respectively splicing each hidden layer text vector and the updated text vector corresponding to each segmented text sentence group to obtain the target text sentence group corresponding to each segmented text sentence group includes:
splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group through a fourth formula to obtain a target text sentence group corresponding to each word-segmented text sentence group, wherein the fourth formula is as follows:
wherein,for the target text sentence group corresponding to the x-th segmented text sentence group, conv x An updated text vector corresponding to the x-th segmented text sentence group, B x And (5) the hidden layer text vector corresponding to the text sentence group after the xth word segmentation.
It should be understood that, two sentences are sent to the GAT layer through the pre-training encoding layer to perform sentence integral modeling, the DPCNN is used to extract sentence characteristics, and finally, the output of the two sentences (i.e. the updated text vector and the hidden layer text vector) is finally connected to the full-connection layer to perform final classification prediction (i.e. the target text sentence group).
Specifically, the output of the pre-training coding layer (i.e. hidden layer text vector) and the output of the DPCNN (i.e. updated text vector) are connected by residual error, so as to obtain the output (i.e. target text sentence group), as follows:
y out =concat(conv,(P,Q)),
concat is a vector stitching operation commonly used in deep learning, and refers to an operation of connecting two or more tensors along a certain dimension, the input tensors can be stitched together in a specified dimension to form a new tensor, and the operation can be performed by using a numpy toolkit in pytorch, so that the difference between two texts can be better reflected, and the complexity of a network can be reduced.
In the above embodiment, each hidden layer text vector and the updated text vector are respectively spliced by the fourth formula to obtain the target text sentence group, so that the difference between the two texts can be better reflected, and the complexity of the network is reduced.
Optionally, as an embodiment of the present application, the predicting the target text sentence groups respectively, and the obtaining the prediction score corresponding to the target text sentence groups includes:
predicting each target text sentence group through a fifth formula to obtain a prediction score corresponding to each target text sentence group, wherein the fifth formula is as follows:
wherein,for the predictive score corresponding to the text sentence group after the xth word segmentation, softmax is the activation function,/->Target text sentences corresponding to the x-th segmented text sentence groupGroup U 1 For the weight matrix, b is the bias.
Specifically, a classification prediction of softmax is performed, as follows:
U 1 representing the weight matrix and b represents the bias.
In the above embodiment, the fifth formula is used to predict each target text sentence group to obtain the prediction score, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of the model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, the global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.
Alternatively, as another embodiment of the present application, the present application first processes the original data set to construct a vector sentence suitable for input into the network. Secondly, respectively inputting the two matched sentences into a pretrained model twin BERT network (SimBERT) to obtain vector representation. Then constructing a graph expressed by sentence relation, and sending the graph into a graph annotation meaning network GAT. And extracting the obtained sentence characteristics through the DPCNN network, and finally accessing the full connection layer to perform classification processing on the result.
The application mainly introduces the concept of the graph in the text matching task, and takes each word vector in the sentence as the node of the graph and the relation between the vectors as the edge of the graph. The global information of sentences can be better acquired by using the graph attention network, and the problems of long-range dependence and the like in long text matching are solved. Meanwhile, a weight sharing mechanism of the SimBERT pre-training model is exerted, and the weight sharing can reduce the number of parameters of the model, improve the effect of feature extraction, enhance the data efficiency and enhance the generalization capability of the model. The DPCNN also reduces the calculated amount and better saves the training cost.
Alternatively, as another embodiment of the present application, the present application uses a graph, which is a data structure that can better model global information, better extracts global information of sentences and context. Meanwhile, the SimBERT twin pre-training model and the DPCNN network are utilized, so that training cost can be better saved, calculated amount is reduced, and efficiency in the model matching process is improved. The method enhances the interactivity between sentences, so that the model better obtains the global context relation of the text.
Optionally, as another embodiment of the present application, the technical problem to be solved by the present application is as follows:
the existing text matching model aims at the relation between the long text sentence before and after the matching and capturing, and the global theme information is not well acquired, so that the semantic matching effect is greatly weakened.
Meanwhile, the calculation amount and the parameter amount of the current long text matching model are large, which is unfavorable for saving the matching cost.
Optionally, as another embodiment of the present application, the technical method for solving the problem of the present application is as follows:
for the problems that the global information of sentences and the context association are difficult to capture in the matching process, and the like. A text representation method based on combination of a twin pre-training model and a graph meaning network is constructed. The data structure which can better model the global information is utilized, and the global information and the context relation of sentences are better extracted. And constructing a twin pre-training model SimBERT network, realizing weight sharing among sentences, and constructing a matching model which is favorable for extracting context relations.
Finally, in the sentence feature extraction process, DPCNN (deep pyramid convolution model) is utilized, and residual connection is used to transfer information across the hierarchy, so that the model is allowed to better capture global information of the text. Residual connection can alleviate the gradient vanishing problem and help the model learn the feature representation more efficiently.
Fig. 2 is a block diagram of a text matching device according to an embodiment of the present application.
Alternatively, as another embodiment of the present application, as shown in fig. 2, a text matching apparatus includes:
an importing module for importing an original text data set;
the word segmentation processing module is used for carrying out word segmentation processing on the original text data set to obtain a plurality of segmented text sentences;
the grouping module is used for grouping all the text sentences subjected to word segmentation into one group in pairs to obtain a plurality of text sentence groups subjected to word segmentation;
the updating module is used for updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and the text matching result obtaining module is used for respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
Alternatively, another embodiment of the present application provides a text matching system including a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements the text matching method as described above. The system may be a computer or the like.
Alternatively, another embodiment of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text matching method as described above.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims (10)

1. A text matching method, comprising the steps of:
importing an original text data set, and performing word segmentation on the original text data set to obtain a plurality of segmented text sentences;
dividing all the text sentences subjected to word segmentation into a group two by two to obtain a plurality of text sentence groups subjected to word segmentation;
updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
2. The text matching method according to claim 1, wherein the process of word segmentation of the original text data set to obtain a plurality of segmented text sentences comprises:
and performing word segmentation processing on the original text data set by utilizing a jieba word segmentation library to obtain a plurality of segmented text sentences.
3. The text matching method according to claim 1, wherein the process of updating each of the segmented text sentence groups to obtain a target text sentence group corresponding to each of the segmented text sentence groups includes:
respectively carrying out vectorization processing on each segmented text sentence group through a pre-training model SimBERT to obtain hidden layer text vectors corresponding to each segmented text sentence group;
vector updating is carried out on each hidden layer text vector respectively to obtain updated text vectors corresponding to each word-segmented text sentence group;
and respectively splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group to obtain a target text sentence group corresponding to each word-segmented text sentence group.
4. A text matching method according to claim 3, wherein the process of respectively performing vector update on each hidden layer text vector to obtain updated text vectors corresponding to each segmented text sentence group comprises:
extracting global word sense from each hidden layer text vector through a Bi-LSTM model to obtain global word sense vectors corresponding to each segmented text sentence group, wherein each global word sense vector comprises a plurality of global word sense nodes;
node updating is carried out on a plurality of global word sense nodes corresponding to each segmented text sentence group respectively, so that a plurality of updated global word sense nodes corresponding to each segmented text sentence group are obtained;
performing maximum pooling processing on a plurality of updated global word sense nodes corresponding to each segmented text sentence group through a first formula to obtain updated text vectors corresponding to each segmented text sentence group, wherein the first formula is as follows:
wherein conv x For the updated text vector corresponding to the x-th word-segmented text sentence group, max pool is the maximum pooling function, reLU is the activation function,and (3) the global word sense node after the i-th update corresponding to the text sentence group after the x-th word segmentation is used, U is a weight matrix, and b is a bias.
5. The text matching method according to claim 4, wherein the step of updating the nodes of the plurality of global word sense nodes corresponding to the respective segmented text sentence groups to obtain the plurality of updated global word sense nodes corresponding to the respective segmented text sentence groups includes:
calculating the attention coefficients of each global word sense node and the rest global word sense nodes respectively through a second formula to obtain a plurality of attention coefficients corresponding to each segmented text sentence group, wherein the second formula is as follows:
wherein (alpha) ij ) x For the attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group, the LeakyReLU is an activation function,for the self-attention matrix, W is the weight matrix,>for the ith global word sense node corresponding to the xth sentence group after the xth word segmentation,/for the text sentence group after the xth word segmentation>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>Corresponds to the text sentence group after the xth word segmentationIs the kth global sense node, ++>Global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group;
and respectively carrying out node update calculation on a plurality of attention coefficients corresponding to each segmented text sentence group and a plurality of global word sense nodes corresponding to each segmented text sentence group through a third formula to obtain a plurality of updated global word sense nodes corresponding to each segmented text sentence group, wherein the third formula is as follows:
wherein,an ith updated global word sense node corresponding to the xth word-segmented text sentence group, (alpha) ij ) x The attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group,for the kth global word sense node corresponding to the xth word-segmented text sentence group,/for the (x) th global word sense node>For the kth global word sense node corresponding to the xth word-segmented text sentence group, sigma is an activation function,>and W is a weight matrix for global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group.
6. The text matching method according to claim 3, wherein the process of concatenating each of the hidden layer text vectors and the updated text vectors corresponding to each of the segmented text sentence sets to obtain the target text sentence set corresponding to each of the segmented text sentence sets includes:
splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group through a fourth formula to obtain a target text sentence group corresponding to each word-segmented text sentence group, wherein the fourth formula is as follows:
wherein,for the target text sentence group corresponding to the x-th segmented text sentence group, conv x An updated text vector corresponding to the x-th segmented text sentence group, B x And (5) the hidden layer text vector corresponding to the text sentence group after the xth word segmentation.
7. The text matching method according to claim 1, wherein the predicting each of the target text sentence groups to obtain a prediction score corresponding to each of the target text sentence groups includes:
predicting each target text sentence group through a fifth formula to obtain a prediction score corresponding to each target text sentence group, wherein the fifth formula is as follows:
wherein,for the predictive score corresponding to the text sentence group after the xth word segmentation, softmax is the activation function,/->For the target text sentence group corresponding to the x-th segmented text sentence group, U 1 For the weight matrix, b is the bias.
8. A text matching apparatus, comprising:
an importing module for importing an original text data set;
the word segmentation processing module is used for carrying out word segmentation processing on the original text data set to obtain a plurality of segmented text sentences;
the grouping module is used for grouping all the text sentences subjected to word segmentation into one group in pairs to obtain a plurality of text sentence groups subjected to word segmentation;
the updating module is used for updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;
and the text matching result obtaining module is used for respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.
9. A text matching system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the text matching method according to any of claims 1 to 7 is implemented when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the text matching method according to any one of claims 1 to 7 is implemented when the computer program is executed by a processor.
CN202310928224.8A 2023-07-26 2023-07-26 Text matching method, device, system and storage medium Pending CN117131153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310928224.8A CN117131153A (en) 2023-07-26 2023-07-26 Text matching method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310928224.8A CN117131153A (en) 2023-07-26 2023-07-26 Text matching method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN117131153A true CN117131153A (en) 2023-11-28

Family

ID=88851810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310928224.8A Pending CN117131153A (en) 2023-07-26 2023-07-26 Text matching method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN117131153A (en)

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Wang et al. Application of convolutional neural network in natural language processing
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN108388560B (en) GRU-CRF conference name identification method based on language model
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN110263325B (en) Chinese word segmentation system
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN112163092B (en) Entity and relation extraction method, system, device and medium
CN111400494B (en) Emotion analysis method based on GCN-Attention
Xing et al. A convolutional neural network for aspect-level sentiment classification
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
Li et al. Multimodal fusion with co-attention mechanism
Touati-Hamad et al. Arabic quran verses authentication using deep learning and word embeddings
CN114881038B (en) Chinese entity and relation extraction method and device based on span and attention mechanism
CN113792121B (en) Training method and device of reading and understanding model, reading and understanding method and device
CN116257601A (en) Illegal word stock construction method and system based on deep learning
CN115687576A (en) Keyword extraction method and device represented by theme constraint
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN114510569A (en) Chemical emergency news classification method based on Chinesebert model and attention mechanism
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114722798A (en) Ironic recognition model based on convolutional neural network and attention system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination