CN113688621B - Text matching method and device for texts with different lengths under different granularities - Google Patents

Text matching method and device for texts with different lengths under different granularities Download PDF

Info

Publication number
CN113688621B
CN113688621B CN202111023691.3A CN202111023691A CN113688621B CN 113688621 B CN113688621 B CN 113688621B CN 202111023691 A CN202111023691 A CN 202111023691A CN 113688621 B CN113688621 B CN 113688621B
Authority
CN
China
Prior art keywords
task
text
model
training
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111023691.3A
Other languages
Chinese (zh)
Other versions
CN113688621A (en
Inventor
魏骁勇
谢东霖
张栩禄
杨震群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202111023691.3A priority Critical patent/CN113688621B/en
Publication of CN113688621A publication Critical patent/CN113688621A/en
Application granted granted Critical
Publication of CN113688621B publication Critical patent/CN113688621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of natural language processing, and provides a text matching method and a text matching device for texts with different lengths under different granularities. The invention aims to solve the problems of different matching granularities, multiple subtasks, unbalanced categories and the need to process ultra-long texts. The main scheme of the invention comprises 1) preparing a data set; 2) Data enhancement; 3) Continuing the pre-training on the task-specific data set; 4) Processing the long text; 5) Designing a multi-task framework; 6) Optimizing the multitask weight; 7) And (5) fine adjustment and training of a neural network model structure. The method is used for text matching of texts with different lengths under different granularities.

Description

Text matching method and device for texts with different lengths under different granularities
Technical Field
The invention relates to text matching of texts with different lengths under different granularities, which can be used for matching of texts with different lengths under two granularities of thickness and belongs to the field of text matching of natural language processing.
Background
Text matching is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question-answering systems, dialogue systems, machine translation and the like, and the NLP tasks can be abstracted into text matching problems to a large extent. For example, web searching may be abstracted as a relevance matching question of web pages and user search Query, automatic question answering may be abstracted as a satisfaction matching question of candidate answers and questions, text deduplication may be abstracted as a similarity matching question of text and text.
The traditional text matching technology comprises algorithms such as BoW, VSM, TF-IDF, BM25, jaccord, simHash and the like, for example, the BM25 algorithm calculates the matching score between a network field and a query field according to the coverage degree of the network field to the query field, and the higher the score, the better the matching degree between a webpage and the query. The method mainly solves the matching problem of the vocabulary level or the similarity problem of the vocabulary level. In practice, matching algorithms based on vocabulary contact ratio have great limitations, including semantic limitations, structural limitations, and knowledge limitations. For example, semantic restrictions: "taxi" and "taxi" are literally dissimilar, but actually are the same vehicle; "apple" in different contexts means something different, either fruit or company. Therefore, for the text matching task, the text matching task cannot only stay at the literal matching level, and the matching at the semantic level is more required. The semantic matching firstly faces the problem of how to represent and calculate the semantics.
In recent years, the deep neural network model has a remarkable effect when being applied to natural language processing tasks, and the deep neural network model can well perform semantic representation on word vectors. Early bag of words model word2vec learned word expressions by using surrounding words to predict the central word and using the central word to predict surrounding words, but this is a static word vector, i.e. the learned word vector is the same expression in different contexts, and different scenarios of solving word ambiguity. The bert model proposed by Google in 2018 well solves the problem of word ambiguity, the structure of the adopted transform can enable word vectors to well add context semantics, and meanwhile, when the transform is applied to a downstream task, the transform is applied to the whole network instead of only using an embedding layer like word2vec, so that the dynamic structure can give different expressions to the same word according to different contexts.
Disclosure of Invention
In view of the above technical problems, the present invention aims to solve the problems of different matching granularities, multiple subtasks, category imbalance and the need to process very long texts.
The technical scheme adopted by the invention is as follows:
a text matching method of texts with different lengths under different granularities comprises the following steps,
step 1, preparing a data set, and labeling text pairs under different matching particle sizes according to the thickness of the matching particle sizes;
step 2, performing data enhancement on the data set in the step 1, and increasing the generalization capability of the model;
step 3, performing model pre-training on the data set subjected to data enhancement to obtain a pre-training model;
step 4, carrying out truncation processing on the long text in the data set subjected to data enhancement in the step 2 to obtain a text after the long text is truncated;
step 5, designing a multi-task framework, wherein information among different model training tasks is also supplemented;
step 6, optimizing the weight of the multi-task framework, and continuing to train the neural network model;
and 7, fine tuning and training of the neural network model structure based on the weight-optimized multitask framework to obtain the neural network model capable of judging whether the text pairs are similar under different granularities.
In the above technical solution, step 1 includes the following steps:
step 1.1, preparing Chinese text pair data sets with different lengths, wherein the Chinese text pair data sets comprise two kinds of matching granularity of thick and thin, the matching of the thick granularity only needs that two texts belong to the same topic, and the matching of the fine granularity requires that the two texts need to describe the same event;
and step 1.2, each granularity comprises three text pairs with different lengths, namely short text pairs, short text pairs and long text pairs, the data sets under the two granularities are not completely the same, and the text pairs under different matching granularities are labeled according to the thickness of the matching granularity.
In the above technical solution, the data enhancement in step 2 includes the following steps:
2.1, enhancing according to a transitivity rule of the similarity;
2.2, enhancement among different granularities, and the conditions are stricter when the fine granularity is matched with the coarse granularity, so that the fine granularity can be used as one enhancement of the coarse granularity.
In the above technical solution, in step 3:
adopting an open-source RoBERTA-WWM Chinese pre-training model, pre-training a text corpus with different lengths in the scene, performing Chinese word segmentation processing on the text in the step 2 by using a jieba library, then randomly selecting 15% of masks in each round, performing mask on 80% of the 15%, immediately replacing 10% of the masks by using other words, and keeping the remaining 10% of the masks unchanged, wherein the masks use a WWM strategy, namely mask masking the whole words to obtain a trained pre-training model;
the loss function for the pre-training phase is as follows:
Figure BDA0003241165160000031
where N is the total number of samples, M is the number of classes, y ic Is a symbolic function, y if the sample i belongs to the class c ic Get 1, otherwise y ic Take 0,p ic Representing the probability that the observed sample i belongs to class c.
In the above technical solution, in step 4:
truncating the long texts in the short text pairs and the long text pairs in the data set in the step 2, wherein a truncation method for extracting key sentences is adopted:
4.1, firstly, carrying out clause division on the long text according to separators;
4.2, taking each sentence as a node, dividing the sentences into words, filtering stop words, and calculating the similarity between every two sentences;
4.3, constructing a node connection graph, taking the similarity between sentences as the weight values of edges between corresponding nodes, and setting a threshold value to filter out the edges with lower weight values;
4.4, calculating the weight value of each sentence according to the weight of the edge, and then iteratively propagating the weight value of each node according to the connection graph until convergence;
and 4.5, sequencing in a reverse order according to the weight value of the sentences, and splicing the sentences meeting the length requirement in the original text according to the sequence in the original text to be used as the text after the truncation of the long text.
In the above technical scheme, in step 5
And 5: the method comprises the steps of using a multi-task model training framework of MTL, mutually supplementing information among different model training tasks, setting coarse-grained text matching as a task A, fine-grained text matching as a task B, and using a hard parameter sharing mode in the MTL, namely sharing pre-training model parameters to different model training tasks under the multi-task model framework.
In the above technical solution, in step 6:
optimizing the weights of different model training tasks, wherein in different training stages, each stage sets dynamic weight for each task, the F1 value of each task after each iteration is used as an evaluation standard of model difficulty, if the F1 value of a certain task is larger, the task is easy to learn, and the task has smaller weight, and the calculation formula of the weights is as follows:
Figure BDA0003241165160000041
wherein w i Representing the weight, k, of task i at each iteration i Refer to KPI, a measure of task iWhere is the F1 value, γ i Is that the hyper-parameter is used to adjust the weight of task i.
In the above technical scheme, in step 7
And 7: weighting samples of a task A and a task B in the multi-task model training task in the step 6, so that the model can focus on the samples which are difficult to learn, increasing the weight of the sample which is wrongly classified at the last time in each iteration, reducing the weight of the sample which is correctly classified, and finally, the loss of the whole model is the loss weighted sum of the two tasks of the task A and the task B, wherein the expression of the loss function of the final model is as follows;
Figure BDA0003241165160000042
wherein w A And w B Respectively, the task A and the task B are calculated according to the weight calculation formula in the step 6 at each iteration, and N is A And N B The data amounts, alpha, of task A and task B, respectively A And alpha B And gamma A And gamma B The hyperparameters, p, for adjusting the sample weights in tasks A and B, respectively i Is the probability of being predicted as a true value.
And 8: and splicing the text pairs in the data set subjected to data enhancement in the step 2, transmitting the spliced text pairs into the network model in the step 7, training a neural network by using a gradient descending strategy by using a label as supervision information, and obtaining the neural network which can judge whether the pairs are similar under different granularities after a plurality of iteration processes.
The invention also provides a text matching device of texts with different lengths under different granularities, which comprises the following modules,
the data set preparation module is used for preparing a data set, labeling the text pairs under different matching granularities according to the thickness of the matching granularity, and training a neural network;
the data set enhancement module is used for enhancing the data of the data set in the step 1 and increasing the generalization capability of the model;
the pre-training model module is used for performing model pre-training on the data set subjected to data enhancement to obtain a pre-training model;
the long text truncation module is used for truncating the long text in the data set subjected to data enhancement in the step 2 to obtain a text after the long text truncation;
the multi-task framework module is used for designing a multi-task framework, and information among different model training tasks is mutually supplemented;
the weight optimization module is used for optimizing the weight of the multi-task framework and continuing to train the neural network model;
and the neural network module is used for carrying out fine adjustment and training on the neural network model structure based on the multitask framework after weight optimization to obtain a neural network model which has the function of judging whether the text pairs are similar under different granularities.
The invention also provides a storage medium, wherein a program for text matching of texts with different lengths under different granularities is stored in the storage medium, and when the program is executed, the gpu realizes the text matching method of the texts with different lengths under different granularities.
The technology adopted by the invention has the following beneficial effects:
1. in step 5, the invention adopts a multi-task model for a plurality of subtasks, so that different tasks can share one model, the occupied memory amount is reduced, the plurality of tasks obtain a result by one-time forward calculation, the reasoning speed is increased, information is shared by associated tasks and is mutually supplemented, and the performances of different tasks are improved. In step 5, the weight of each subtask is dynamically adjusted, so that the multitask model can be well converged for each subtask in a training stage, and the weight of each sample is increased in step 6, so that the model can better learn hard samples, and the performance of the model is improved;
2. according to the method, the key sentences are extracted from the long text in the step 4, instead of directly carrying out rough head and tail truncation and the like, so that the matching effect of the long text is improved, and in addition, the generalization capability of the model is improved by enhancing the data according to the rules and different granularities in the step 2;
3. in the invention, a text pair splicing mode is adopted in the step 8, compared with the mode that each text is independently input into the Chinese pre-training model to obtain the representation of the text, the training time of the model is reduced, the interaction of the text pairs on the bottom layer is also increased after the spliced text passes through the attention module in the pre-training model, and the performance is also improved.
Drawings
FIG. 1 is a pre-training flow diagram.
FIG. 2 is a diagram of a multitasking model framework.
Detailed Description
The invention provides a text matching method of texts with different lengths under different granularities, and the matching of the texts is carried out on the texts with three types of lengths, namely short and short texts, short and long texts and long and short texts under two matching granularities.
The main process of the invention comprises: 1) Preparing a data set; 2) Data enhancement; 3) Continuing the pre-training on the task-specific data set; 4) Processing the long text; 5) Designing a multitask framework; 6) Optimizing the multitask weight; 7) The method specifically comprises the following steps of fine tuning and training of a neural network model structure:
1. preparing a data set
A Chinese text pair data set with different lengths is prepared, and comprises two matching granularities of thickness and fineness, wherein the matching of the thickness only needs that two texts belong to the same topic, and the matching of the fineness requires that the two texts have to describe the same event. The method comprises three text pairs with different lengths, namely a short text pair, a short text pair and a long text pair, under each granularity, wherein data sets under the two granularities are not completely the same, and the text pairs under different matching granularities are labeled according to the thickness of the matching granularity and are used for training the neural network. The three different types of text pairs generally correspond to different matching scenes, for example, the short text pair can be used for automatic question answering, the short text pair can be used for retrieval, the long text pair can be used for news article recommendation, similar articles are recommended below one piece of news, and the like, different matching granularities also correspond to different application scenes, so that multiple models are required to be trained corresponding to different application scenes, the models are too large, redundancy exists among the models, and the scenes are similar, so that the multi-task framework finally designed by the method only trains one model, and the quantity of model parameters is greatly reduced.
2. Data enhancement
And (3) performing data enhancement on the data set in the step 1 to increase the generalization capability of the model. Two main ways of enhancement are adopted:
1. enhancement is performed according to rules: let the short text be s and the long text be l, then according to the transitivity principle:
1.1 if s i And s j Is similar to and s j And s k Similarly, then s i And s and k similarly;
1.2 if s i And s j Is similar to and s j And l k Similarly, then s i And l k Similarly;
1.3 if s i And l j Are similar to each other j And l k Similarly, then s i And l k Similarly;
1.4, if l i And l j Are similar to each other j And l k Similarly, then l i And l k Similarly;
2. enhancement between different particle sizes:
2.1, if the text pairs are similar under the fine granularity, the text pairs are also similar under the coarse granularity;
2.2, if the text pairs are dissimilar under the fine granularity, the text pairs are certainly dissimilar under the coarse granularity;
the text pairs meeting the conditions under the fine granularity can be used as data enhancement under the coarse granularity;
3. continuing pre-training on a task-specific dataset
Adopting an open-source RoBERTA-WWM Chinese pre-training model, performing further pre-training on text corpora with different lengths in the scene, directly performing population on the text in the step 2 in a pre-training stage, then performing Chinese word segmentation processing by using a jieba library, then randomly selecting 15% of the 15% to perform mask in each round, performing mask on 80% of the 15% and immediately replacing the 10% with other words, wherein the rest 10% is unchanged, the mask uses a WWM (worldwide mask) strategy, namely performing mask on a whole word instead of a certain word in the word, thereby increasing the challenge of the task, simultaneously removing the NSP task, and researches show that the NSP task is too simple, so that the performance of the reverse damage model cannot be improved, and the loss function in the pre-training stage is as follows:
Figure BDA0003241165160000071
where N is the total number of samples, M is the number of classes, i.e. the number of all words and symbols, y ic Is a symbolic function, y if the sample i belongs to the class c ic Get 1, otherwise y ic Take 0,p ic Representing the probability of the observation sample i belonging to the category c;
4. processing long text
Since the RoBERTa-wwm model used in step 3 has a limit on the length of the input text (the maximum length is 512), the short-long text pair and the long-long text pair in the data set in step 2 need to be truncated, and here, a truncation method for extracting a key sentence is adopted: 1) Firstly, separating sentences of long texts according to delimiters (periods, exclamation marks, question marks and the like); 2) Taking each sentence as a node, segmenting the sentences, filtering stop words, and calculating the similarity between every two sentences; 3) Constructing a node connection graph, taking the similarity between sentences as the weight of edges between corresponding nodes, and setting a threshold value to filter the edges with lower weight; 4) Calculating the weight value of each sentence according to the edge weight, wherein the calculation formula is as follows, and then iteratively propagating the weight of each node according to the connection graph until convergence; 5) And carrying out reverse sequencing according to the weight of the sentences, and selecting the sentences meeting the length requirement to be spliced according to the sequence in the original text to be used as the text after the truncation of the long text.
Figure BDA0003241165160000072
Figure BDA0003241165160000073
Wherein S i ,S j Respectively representing two sentences, similarity (S) i ,S j ) Denotes S i ,S j Similarity of (2), w k Representing words in sentences, the numerator part is the number of the same word appearing in two sentences at the same time, and the denominator is the sum of logarithms of the numbers of the words in the sentences; w (S) i ) Representing a sentence S i D is the damping coefficient, out (S) i ) Denotes S i The neighbor node of (2);
for example, for long text "natural language processing" refers to a technique of interactive communication with a machine using natural language used for human communication. The natural language is processed by human, so that the computer can read and understand the natural language. Relevant research in natural language processing begins with human exploration of machine translation. \8230'
1) Results after clause: "natural language processing" refers to a technology for interactive communication with a machine using natural language used for human communication. "and" are processed by human beings to make them readable and understandable by a computer. "relevant research on natural language processing begins with the human exploration of machine translation. "\8230; \8230, the three sentences above are respectively set as s1, s2 and s3. 2) Taking each sentence as a node, segmenting each sentence, and filtering out the result after the words are stopped: ' natural language ', ' processing ', ' finger ', ' use ', ' human ', ' use ', ' natural language ', ' machine ', ' perform ', ' interact ', ' communicate ', ' technique ', ' human ', ' natural language ', ' processing ', ' computer ', ' can ', ' read ', ' understand ' ], [ ' natural language ', ' process ', ' correlation ', ' study ', ' start with ', ' human ', ' machine translation ', ' explore ' ] ' 8230 \8230, and then calculate the similarity between sentences, for example, the similarity between sentences s1 and s2 is 0.443 and the similarity between s1 and s3 is 0.646. 3) And constructing a connection graph according to the calculated weights of the edges. 4) The weights of the continuously iterative propagation calculation sentences are converged straightly, and the weights of s1, s2 and s3 are 0.364, 0.309 and 0.326 respectively. 5) The sentences are sorted from big to small according to the weight, and are spliced according to the position sequence of the original text until the length requirement is met, for example, s1, s3 and s2 are sorted, if the length of s1+ s3 reaches the requirement, the finally processed text is the splicing result of s1 and s3, and finally, the natural language processing refers to the technology of utilizing natural language used by human communication to carry out interactive communication with a machine. Relevant research in natural language processing begins with human exploration of machine translation. ".
5. Designing a multitasking framework
By using a Multi-Task framework of MTL (Multi-Task Learning), information between different tasks can be supplemented with each other, the parameter amount of the model can be greatly reduced, coarse-grained text matching is set as a Task A, fine-grained text matching is set as a Task B, a hard parameter sharing mode in the MTL is used, namely, parameters of a bottom model are shared hard, an output layer of an upper layer and a lower layer of two specific tasks correspond to the tasks A and B, and the parameters of the bottom model are used in the step 3 and the trained Chinese language model.
6. Multitask weight optimization
The weights of different tasks are optimized, because the loss functions of the two tasks A and B are changed in different stages of training and the convergence conditions of each task in different stages are inconsistent, a dynamic weight needs to be set for each task in each stage to ensure that a model is not dominated or biased by a certain task, and when the model tends to fit a certain task, the effects of other tasks are often negatively affected and the effects are relatively poor.
After each iteration, calculating the F1 value of each task, then calculating the weight of each task through the following formula, and when the total loss of the multi-task model is calculated, weighting and back-propagating the loss of each task by using the calculated weight. The F1 value is used as the evaluation criterion of the model difficulty level, if the F1 value of a certain task is larger, the task is easy to learn, and the task should have smaller weight, and the calculation formula of the weight is as follows:
Figure BDA0003241165160000091
wherein w i Weight, k, representing task i per iteration i Refers to the KPI, a measure of task i, here the F1 value, γ i Is that the hyper-parameter is used to adjust the weight of task i
7. Fine tuning and training of neural network model structures
For the subtasks a and B in the multitask model in step 6, the learning difficulty of each sample in the subtasks is different, so that the samples of each subtask are weighted, the model can focus on the samples which are difficult to learn, the weight of the sample which is wrongly classified last time is increased in each iteration, and the weight of the sample which is correctly classified is reduced. After each iteration, calculating the weight of each sample according to the probability of the predicted true value of each sample, wherein the weight is smaller if the probability of the predicted true value is larger, and otherwise, the weight is larger. Finally, the loss of the whole model is loss weighted sum of the tasks A and B, and the expression of the loss function of the final model is as follows;
Figure BDA0003241165160000092
wherein w A And w B Respectively, the task A and the task B are calculated according to the weight calculation formula in the step 6 at each iteration, and N is A And N B The data amounts, alpha, of task A and task B, respectively A And alpha B And gamma A And gamma B The hyper-parameters, p, for adjusting the sample weights in tasks A, B, respectively i Is the probability of being predicted as a true value.
And splicing the text pairs processed in the step 3 by using a connector [ SEP ], transmitting the spliced text pairs into the network model in the step 7, performing mean pooling on the output of [ CLS ] of each layer of the pre-training model to be used as the expression of the text pairs, respectively inputting the expression into two output layers, training a neural network by using a gradient descent strategy by using a label as supervision information, and obtaining the neural network which can judge whether the pairs are similar under different granularities after a plurality of iteration processes.

Claims (9)

1. A text matching method of texts with different lengths under different granularities is characterized by comprising the following steps,
step 1, preparing a data set, and labeling text pairs under different matching particle sizes according to the thickness of the matching particle sizes;
step 2, performing data enhancement on the data set in the step 1, and increasing the generalization capability of the model;
step 3, performing model pre-training on the data set subjected to data enhancement to obtain a pre-training model;
step 4, performing truncation processing on the long text in the data set subjected to data enhancement in the step 2 to obtain a text after the long text is truncated;
step 5, designing a multi-task framework, wherein information among different model training tasks is also supplemented;
step 6, optimizing the weight of the multi-task framework, and continuing to train the neural network model;
step 7, fine tuning and training of a neural network model structure are carried out based on the weight-optimized multitask framework, a neural network model which can judge whether text pairs are similar under different granularities is obtained, samples of a task A and a task B in the multitask model training task in the step 6 are weighted, the model can focus on the samples which are difficult to learn, the weight of the sample which is wrongly classified last time is increased during each iteration, the weight of the sample which is correctly classified is reduced, finally the loss of the whole model is the loss weighted sum of the two tasks of the task A and the task B, and the expression of the loss function of the model is as follows;
Figure QLYQS_1
wherein
Figure QLYQS_3
And &>
Figure QLYQS_7
Task A and task B are calculated according to the weight calculation formula in step 6 at each iteration and are then evaluated in conjunction with the evaluation value>
Figure QLYQS_9
And &>
Figure QLYQS_4
The data quant, which are respectively task A and task B, are asserted>
Figure QLYQS_6
And &>
Figure QLYQS_8
And->
Figure QLYQS_10
And &>
Figure QLYQS_2
A hyperparameter, which adjusts the sample weight in task A, B, respectively>
Figure QLYQS_5
Is the probability of being predicted as a true value;
and 8: and then splicing the text pairs in the data set subjected to data enhancement in the step 2, transmitting the spliced text pairs into the network model in the step 7, training a neural network by using a gradient descending strategy by using a label as supervision information, and obtaining the neural network which can judge whether the pairs are similar under different granularities after a plurality of iterative processes.
2. The method for matching texts with different lengths at different granularities according to claim 1, wherein the step 1 comprises the following steps:
step 1.1, preparing Chinese text pair data sets with different lengths, wherein the Chinese text pair data sets comprise two kinds of matching granularity of thick and thin, the matching of the thick granularity only needs that two texts belong to the same topic, and the matching of the fine granularity requires that the two texts need to describe the same event;
and step 1.2, each granularity comprises three text pairs with different lengths, namely short text pairs, short text pairs and long text pairs, the data sets under the two granularities are not completely the same, and the text pairs under different matching granularities are labeled according to the thickness of the matching granularity.
3. The method for matching texts with different lengths under different granularities according to claim 1, wherein the data enhancement in step 2 comprises the following steps:
2.1, enhancing according to the transitivity rule of the similarity;
2.2, enhancement among different granularities, and the conditions are stricter when the fine granularity is matched with the coarse granularity, so that the fine granularity can be used as one enhancement of the coarse granularity.
4. The method for matching texts with different sizes under different granularities according to claim 1, wherein in step 3:
adopting an open-source RoBERTA-WWM Chinese pre-training model, pre-training a text corpus in the scene with different lengths, performing Chinese word segmentation processing on the text in the step 2 by using a jieba library, then randomly selecting 15% of the texts in each round to perform mask, performing mask on 80% of the 15%, immediately replacing 10% of the texts by using other words, and keeping the remaining 10% unchanged, wherein the mask uses a WWM strategy, namely performing mask on the whole words to obtain a trained pre-training model;
the loss function for the pre-training phase is as follows:
Figure QLYQS_11
wherein
Figure QLYQS_12
Is the total number of samples, based on the number of samples in the sample group>
Figure QLYQS_17
Is the number of the category, is>
Figure QLYQS_20
Is a sign function if the sample->
Figure QLYQS_14
Is in the category c->
Figure QLYQS_15
Fetch 1, otherwise->
Figure QLYQS_18
Take 0 and/or>
Figure QLYQS_19
Indicates the observation sample->
Figure QLYQS_13
Belongs to the category>
Figure QLYQS_16
The probability of (c).
5. The method for matching texts with different sizes under different granularities according to claim 1, wherein in step 4:
truncating the long texts in the short text pairs and the long text pairs in the data set in the step 2, wherein a truncation method for extracting key sentences is adopted:
4.1, firstly, clauses are carried out on the long text according to separators;
4.2, taking each sentence as a node, dividing the sentences into words, filtering stop words, and calculating the similarity between every two sentences;
4.3, constructing a node connection graph, taking the similarity between sentences as the weight values of edges between corresponding nodes, and setting a threshold value to filter out the edges with lower weight values;
4.4, calculating the weight value of each sentence according to the weight of the edge, and then iteratively propagating the weight value of each node according to the connection graph until convergence;
and 4.5, sequencing in a reverse order according to the weight value of the sentences, and splicing the sentences meeting the length requirement in the original text according to the sequence in the original text to be used as the text after the truncation of the long text.
6. The method for matching texts with different lengths under different granularities according to claim 1, wherein in step 5:
and 5: and using a multi-task model training framework of the MTL, mutually supplementing information among different model training tasks, setting coarse-grained text matching as a task A, setting fine-grained text matching as a task B, and using a hard parameter sharing mode in the MTL, namely sharing pre-training model parameters to different model training tasks under the multi-task model framework.
7. The method for matching texts with different sizes under different granularities according to claim 1, wherein in step 6:
optimizing the weights of different model training tasks, wherein in different training stages, each stage sets dynamic weight for each task, the F1 value of each task after each iteration is used as an evaluation standard of model difficulty, if the F1 value of a certain task is larger, the task is easy to learn, and the task has smaller weight, and the calculation formula of the weights is as follows:
Figure QLYQS_21
wherein
Figure QLYQS_22
To representEach iteration task pick>
Figure QLYQS_23
Based on the weight of->
Figure QLYQS_24
Refer to KPI, i.e. task->
Figure QLYQS_25
Is the value F1, is a hyperparameter for adjusting tasks->
Figure QLYQS_26
The weight of (c).
8. A text matching device for texts with different lengths under different granularities is characterized by comprising the following modules,
the data set preparation module is used for preparing a data set, labeling the text pairs under different matching granularities according to the thickness of the matching granularity, and training a neural network;
the data set enhancing module is used for enhancing the data set of the data set preparing module to increase the generalization capability of the model;
the pre-training model module is used for performing model pre-training on the data set subjected to data enhancement to obtain a pre-training model;
the long text truncation module is used for truncating the long text in the data set after the data enhancement of the data set enhancement module to obtain the text after the truncation of the long text;
the multi-task framework module is used for designing a multi-task framework, and information among different model training tasks is mutually supplemented;
the weight optimization module is used for optimizing the weight of the multi-task framework and continuing to train the neural network model;
the neural network module is used for carrying out fine adjustment and training on the neural network model structure based on the multitask framework after weight optimization to obtain a neural network model which can judge whether text pairs are similar under different granularities;
weighting samples of a task A and a task B in a multi-task model training task in a weight optimization module, so that the model can focus on the samples which are difficult to learn, increasing the weight of the sample which is wrongly classified at the last time in each iteration, reducing the weight of the sample which is correctly classified, and finally, the loss of the whole model is the loss weighted sum of the two tasks of the task A and the task B, wherein the expression of the loss function of the final model is as follows;
Figure QLYQS_27
wherein
Figure QLYQS_30
And &>
Figure QLYQS_32
Task A and task B are calculated according to the weight calculation formula in step 6 at each iteration and are then evaluated in conjunction with the evaluation value>
Figure QLYQS_35
And &>
Figure QLYQS_29
The data quant, which are respectively task A and task B, are asserted>
Figure QLYQS_33
And &>
Figure QLYQS_34
And->
Figure QLYQS_36
And &>
Figure QLYQS_28
A hyperparameter in task A, B, respectively, that adjusts the sample weight, based on the current location of the sample in the sample database, and based on the current location of the sample in the database>
Figure QLYQS_31
Is the probability of being predicted as a true value;
and then splicing the text pairs in the data set after data enhancement in the data set enhancement module into a neural network model of the neural network module, training the neural network by using a gradient descending strategy by using the label as supervision information, and obtaining the neural network which can judge whether the pairs are similar under different granularities after a plurality of iterative processes.
9. A storage medium, wherein the storage medium stores a program for matching texts with different sizes at different granularities, and gpu implements the method for matching texts with different sizes at different granularities according to any one of claims 1 to 7 when executing the program.
CN202111023691.3A 2021-09-01 2021-09-01 Text matching method and device for texts with different lengths under different granularities Active CN113688621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111023691.3A CN113688621B (en) 2021-09-01 2021-09-01 Text matching method and device for texts with different lengths under different granularities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111023691.3A CN113688621B (en) 2021-09-01 2021-09-01 Text matching method and device for texts with different lengths under different granularities

Publications (2)

Publication Number Publication Date
CN113688621A CN113688621A (en) 2021-11-23
CN113688621B true CN113688621B (en) 2023-04-07

Family

ID=78584919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111023691.3A Active CN113688621B (en) 2021-09-01 2021-09-01 Text matching method and device for texts with different lengths under different granularities

Country Status (1)

Country Link
CN (1) CN113688621B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988085B (en) * 2021-12-29 2022-04-01 深圳市北科瑞声科技股份有限公司 Text semantic similarity matching method and device, electronic equipment and storage medium
CN114942980B (en) * 2022-07-22 2022-12-27 北京搜狐新媒体信息技术有限公司 Method and device for determining text matching

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920648A (en) * 2018-07-03 2018-11-30 四川大学 It is a kind of based on music-image, semantic relationship across mode matching method
CN111460149A (en) * 2020-03-27 2020-07-28 科大讯飞股份有限公司 Text classification method, related equipment and readable storage medium
CN112966103A (en) * 2021-02-05 2021-06-15 成都信息工程大学 Mixed attention mechanism text title matching method based on multi-task learning
CN113158665A (en) * 2021-04-02 2021-07-23 西安交通大学 Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210241147A1 (en) * 2020-11-02 2021-08-05 Beijing More Health Technology Group Co. Ltd. Method and device for predicting pair of similar questions and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920648A (en) * 2018-07-03 2018-11-30 四川大学 It is a kind of based on music-image, semantic relationship across mode matching method
CN111460149A (en) * 2020-03-27 2020-07-28 科大讯飞股份有限公司 Text classification method, related equipment and readable storage medium
CN112966103A (en) * 2021-02-05 2021-06-15 成都信息工程大学 Mixed attention mechanism text title matching method based on multi-task learning
CN113158665A (en) * 2021-04-02 2021-07-23 西安交通大学 Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN113239700A (en) * 2021-04-27 2021-08-10 哈尔滨理工大学 Text semantic matching device, system, method and storage medium for improving BERT

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Hanyan Duan 等.Multi-Task Semantic Matching model for Small Noisy Data Set.《The 16th International Conference on Computer Science &Education》.2021,第1114-1119页. *
刘奕洋;余正涛;高盛祥;郭军军;张亚飞;聂冰鸽;.基于机器阅读理解的中文命名实体识别方法.模式识别与人工智能.2020,第33卷(第07期),第653-659页. *
李芳芳 等.基于多任务联合训练的法律文本机器阅读理解模型.《中文信息学报》.2021,第35卷(第35期),第109-117页. *
高荣星 等.一种基于Adaboost-SVM的高层次语义概念提取方法.《计算机应用与软件》.2012,第29卷(第29期),第24-26页. *

Also Published As

Publication number Publication date
CN113688621A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN109753566B (en) Model training method for cross-domain emotion analysis based on convolutional neural network
CN106503192A (en) Name entity recognition method and device based on artificial intelligence
Ju et al. An efficient method for document categorization based on word2vec and latent semantic analysis
CN113688621B (en) Text matching method and device for texts with different lengths under different granularities
CN112699216A (en) End-to-end language model pre-training method, system, device and storage medium
CN108038106B (en) Fine-grained domain term self-learning method based on context semantics
CN112395393A (en) Remote supervision relation extraction method based on multitask and multiple examples
CN111460157A (en) Cyclic convolution multitask learning method for multi-field text classification
CN112199505B (en) Cross-domain emotion classification method and system based on feature representation learning
Banik et al. Gru based named entity recognition system for bangla online newspapers
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN114265937A (en) Intelligent classification analysis method and system of scientific and technological information, storage medium and server
Adeleke et al. Automating quranic verses labeling using machine learning approach
CN114428850A (en) Text retrieval matching method and system
Varghese et al. Bidirectional LSTM joint model for intent classification and named entity recognition in natural language understanding
Gourru et al. Document network projection in pretrained word embedding space
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN116304063B (en) Simple emotion knowledge enhancement prompt tuning aspect-level emotion classification method
CN112463982A (en) Relationship extraction method based on explicit and implicit entity constraint
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
Veisi Central Kurdish Sentiment Analysis Using Deep Learning.
Kim et al. CNN based sentence classification with semantic features using word clustering
Gao et al. Chinese short text classification method based on word embedding and Long Short-Term Memory Neural Network
CN113051892A (en) Chinese word sense disambiguation method based on transformer model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant