CN114003698B - Text retrieval method, system, equipment and storage medium - Google Patents

Text retrieval method, system, equipment and storage medium Download PDF

Info

Publication number
CN114003698B
CN114003698B CN202111609947.9A CN202111609947A CN114003698B CN 114003698 B CN114003698 B CN 114003698B CN 202111609947 A CN202111609947 A CN 202111609947A CN 114003698 B CN114003698 B CN 114003698B
Authority
CN
China
Prior art keywords
encoder
model
training
text retrieval
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111609947.9A
Other languages
Chinese (zh)
Other versions
CN114003698A (en
Inventor
郭湘
黄鹏
江岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xiaoduo Technology Co ltd
Original Assignee
Chengdu Xiaoduo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Xiaoduo Technology Co ltd filed Critical Chengdu Xiaoduo Technology Co ltd
Priority to CN202111609947.9A priority Critical patent/CN114003698B/en
Publication of CN114003698A publication Critical patent/CN114003698A/en
Application granted granted Critical
Publication of CN114003698B publication Critical patent/CN114003698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text retrieval method, a system, a device and a storage medium, comprising the following steps: using a pre-training language model as an encoder, and performing self-attention and mask processing on a batch of labeled similar sentence pairs through the encoder; performing pooling treatment on the final code, and guiding training according to a cross entropy loss function; constructing positive samples for input through data enhancement
Figure 541871DEST_PATH_IMAGE001
Will be
Figure 305428DEST_PATH_IMAGE002
And
Figure 608233DEST_PATH_IMAGE001
input to the encoder to obtain a representation vector
Figure 90030DEST_PATH_IMAGE003
And
Figure 238115DEST_PATH_IMAGE004
(ii) a Calculating the similarity between the expression vector and other vectors in the batch, sorting the candidate texts according to the similarity, and obtaining the candidate texts byThe final loss function guides the iterative training of the network parameters; and performing text retrieval based on the trained model. The method can enhance the generalization capability of the model by adding the supervised training with labeled samples; based on an attention mask mechanism, the model has similarity text reasoning capability; based on comparative learning, the model is made to have text retrieval capability in an unsupervised manner.

Description

Text retrieval method, system, equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text retrieval method, a system, equipment and a storage medium.
Background
Under the current internet, searching has become the main way people know the world, and the most central to searching is the judgment of the relevance of texts. The foregoing text similarity determination mainly includes labeling similar text pairs, and performing supervised training on models, such as conventional two-tower models DSSM, ESIM, and the like, but the foregoing method has at least the following disadvantages: (1) the labeling consumes a large amount of manpower, and is not suitable for large-scale text retrieval tasks, such as hundred-degree search; (2) data outside the field is not applicable because the data distribution and the labeling do not conform; (3) the model is very difficult to update, the natural language changes day by day, and when a new word is involved, the new word needs to be labeled and trained again to complete the text retrieval.
Disclosure of Invention
The invention aims to provide a text retrieval method, a system, equipment and a storage medium, which only need a small amount of labeled data for supervised learning, perform unsupervised training on a large batch of data sets in a comparative learning mode, are suitable for text retrieval in the whole field and aim to solve the problems pointed out in the background art.
The embodiment of the invention is realized by the following technical scheme: a text retrieval method comprises the following steps:
s1, a pre-training language model is used as an encoder, and self-attention and local mask processing is carried out on a batch of similar sentences marked by the encoder;
s2, performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;
s3, giving input
Figure DEST_PATH_IMAGE001
Constructing positive samples by means of data enhancement
Figure 465118DEST_PATH_IMAGE002
Will be
Figure 733289DEST_PATH_IMAGE001
And
Figure 522253DEST_PATH_IMAGE002
inputting into a coder for fitting training to obtain two expression vectors
Figure DEST_PATH_IMAGE003
And
Figure 719141DEST_PATH_IMAGE004
s4, respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function of a softmax function and a cross entropy calculation model;
and S5, performing text retrieval based on the trained model.
Further, the pre-training language model adopts one of Bert, Roberta or tiny _ Bert.
According to a preferred embodiment, the pre-training language model adopts a Bert structure, wherein the self-attention and local mask processing on a batch of labeled similar sentences by an encoder comprises:
inputting a batch of similar sentences which are subjected to labeling processing into Bert, and performing local masking by using an attention layer, wherein the expression is as follows:
Figure DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,
Figure 491925DEST_PATH_IMAGE006
wherein
Figure DEST_PATH_IMAGE007
Respectively represent
Figure 563787DEST_PATH_IMAGE008
The sequence of vectors of (a) is,
Figure DEST_PATH_IMAGE009
the representation of the network layer is shown,
Figure 738416DEST_PATH_IMAGE010
the output of the upper layer is represented as,
Figure DEST_PATH_IMAGE011
a representation of a trainable parameter is provided that,
Figure 870320DEST_PATH_IMAGE012
to represent
Figure DEST_PATH_IMAGE013
The vectors of (A) are multiplied by two in pairs,
Figure 864821DEST_PATH_IMAGE014
the number of dimensions representing the encoding is indicated,
Figure DEST_PATH_IMAGE015
the expression of the normalization function is used,
Figure 740373DEST_PATH_IMAGE016
indicates whether to proceed or not
Figure DEST_PATH_IMAGE017
Figure 238350DEST_PATH_IMAGE018
Finger-shaped
Figure DEST_PATH_IMAGE019
The matrix is a matrix of a plurality of matrices,V l to represent
Figure 42621DEST_PATH_IMAGE006
At the network layerlWhen performing attention operation, inputV
According to a preferred embodiment, the expression for maximum pooling and average pooling of the final codes of the last layer of the network and splicing the two results is as follows:
Figure 258838DEST_PATH_IMAGE020
in the above formula, the first and second carbon atoms are,
Figure DEST_PATH_IMAGE021
the weight parameter is represented by a weight value,
Figure 672502DEST_PATH_IMAGE022
representing the final encoding of the last layer of the network,
Figure DEST_PATH_IMAGE023
the maximum pooling operation is indicated by the number of pools,
Figure 556144DEST_PATH_IMAGE024
represents an average pooling operation;
the expression of the loss function of the encoder calculated by the softmax function and the cross entropy is as follows:
Figure DEST_PATH_IMAGE025
in the above formula, the first and second carbon atoms are,
Figure 29851DEST_PATH_IMAGE026
represents a cross-entropy loss function of the entropy of the sample,
Figure DEST_PATH_IMAGE027
representing similar sentences versus corresponding true tags.
Further, the data enhancement mode includes but is not limited to one of synonym replacement, sentence truncation, reverse translation, punctuation addition, unimportant word deletion and word order rearrangement.
According to a preferred embodiment, the data-enhanced method uses synonym substitution, wherein the positive examples are constructed in a data-enhanced manner
Figure 998944DEST_PATH_IMAGE002
The method comprises the following steps:
to pair
Figure 950720DEST_PATH_IMAGE028
Performing word segmentation, and selecting word components appearing in the synonym data set from the word segmentation
Figure DEST_PATH_IMAGE029
Collecting;
generate a ratio of less than
Figure 688868DEST_PATH_IMAGE030
Random number of set length
Figure DEST_PATH_IMAGE031
With uniform distribution from
Figure 333476DEST_PATH_IMAGE032
Synonym replacement is performed in a set, and the expression is as follows:
Figure DEST_PATH_IMAGE033
in the above formula, the first and second carbon atoms are,
Figure 291330DEST_PATH_IMAGE034
to represent
Figure DEST_PATH_IMAGE035
The length of the set.
According to a preferred embodiment, the final loss function of the model is calculated by a softmax function and a cross entropy, and the obtained final loss function is represented as:
Figure 46797DEST_PATH_IMAGE036
in the above formula, the first and second carbon atoms are,
Figure DEST_PATH_IMAGE039
indicates an index only wheni = jIs equal to 1 in time, and is,
Figure 189382DEST_PATH_IMAGE040
representing the total number of sentences in the batch,
Figure DEST_PATH_IMAGE041
the temperature is shown to be over-temperature,
Figure 133067DEST_PATH_IMAGE042
representing cosine similarity
Figure DEST_PATH_IMAGE043
Figure 692225DEST_PATH_IMAGE044
Representing cosine similarity
Figure DEST_PATH_IMAGE045
Figure 139387DEST_PATH_IMAGE046
Representing cosine similarity
Figure DEST_PATH_IMAGE047
Figure 125797DEST_PATH_IMAGE048
To represent
Figure DEST_PATH_IMAGE049
Positive sample
Figure 291199DEST_PATH_IMAGE050
Is used to represent a vector of (a) a,
Figure DEST_PATH_IMAGE051
to represent
Figure 155512DEST_PATH_IMAGE049
Negative sample
Figure 722760DEST_PATH_IMAGE052
Is used to represent a vector of (a) a,
Figure DEST_PATH_IMAGE053
representing the training parameters of the model.
The invention also provides a text retrieval system applied to the method, which comprises the following steps:
the coder fine-tuning module is used for performing self-attention and local mask processing on a batch of labeled similar sentences through the coder by using the pre-training language model as the coder;
performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;
a contrast learning module for giving input
Figure 880072DEST_PATH_IMAGE054
Constructing positive samples by means of data enhancement
Figure DEST_PATH_IMAGE055
Will be
Figure 532770DEST_PATH_IMAGE054
And
Figure 168151DEST_PATH_IMAGE055
inputting into a coder for fitting training to obtain two expression vectors
Figure 589905DEST_PATH_IMAGE056
And
Figure DEST_PATH_IMAGE057
respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function through a softmax function and a final loss function of a cross entropy calculation model;
and the text retrieval module is used for performing text retrieval based on the trained model.
The present invention also provides an electronic device comprising:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the method as described above.
The invention also provides a readable storage medium having stored therein executable instructions, which when executed by a processor, are adapted to implement the method as described above.
The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects: by adding supervised training with labeled samples, the generalization capability of the model can be enhanced only by using a small amount of labeled data; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted; by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model; the method can obtain higher similarity scores for similar sentences with different lengths.
Drawings
Fig. 1 is a schematic flowchart of a text retrieval method according to embodiment 1 of the present invention;
FIG. 2 is a diagram of a model framework provided in embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of an Attention Mask provided in example 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Example 1
Through research of the applicant, the traditional two-tower models DSSM, ESIM and the like are mainly characterized by labeling similar text pairs, so that the models are trained supervised, but the above method has at least the following disadvantages: (1) the labeling consumes a large amount of manpower and is not suitable for large-scale search tasks, such as hundred-degree search; (2) data outside the field is not applicable because the data distribution and the labeling do not conform; (3) the model is very difficult to update, the natural language changes day by day, and when a new word is involved, the new word needs to be labeled and trained again to complete the text retrieval.
Therefore, the embodiment of the invention provides a text retrieval method based on an attention mask and a comparison learning pre-training model, only a small amount of labeled data is needed for supervised learning, and unsupervised training is carried out on a large amount of data sets in a comparison learning mode, so that the method is suitable for text retrieval in the whole field and aims to solve the problems pointed out in the background technology. The specific scheme is as follows:
referring to fig. 1, the text retrieval method based on the attention mask and the contrast learning pre-training model mainly includes two steps, an encoder fine tuning step and a contrast learning step, which are described in detail below.
Firstly, an encoder fine tuning step is executed:
using a pre-training language model as an encoder, and performing self-attention and local mask processing on a batch of labeled similar sentences through the encoder; it should be noted that, in the application, by adding supervised training with labeled samples, the generalization capability of the model can be enhanced only by using a small amount of labeled data.
In an implementation manner of this embodiment, the pre-training language model adopts a Bert structure, and in addition, the pre-training language model may also adopt Roberta or tiny _ Bert, which is not described in detail again.
The embodiment is mainly illustrated by bert of a transform, wherein the processing, by an encoder, of self-attention and local masking on a batch of labeled similar sentences specifically includes:
referring to fig. 2, a batch of similar sentences after label processing is input into Bert, and local masking is performed by using an attention layer, and the expression is as follows:
Figure 918118DEST_PATH_IMAGE058
in the above formula, the first and second carbon atoms are,
Figure DEST_PATH_IMAGE059
wherein
Figure 887473DEST_PATH_IMAGE060
Respectively represent
Figure DEST_PATH_IMAGE061
The sequence of vectors of (a) is,
Figure 592124DEST_PATH_IMAGE062
the representation of the network layer is shown,
Figure DEST_PATH_IMAGE063
the output of the upper layer is represented as,
Figure 133964DEST_PATH_IMAGE064
a representation of a trainable parameter is provided that,
Figure 367499DEST_PATH_IMAGE065
to represent
Figure 729210DEST_PATH_IMAGE066
The vectors of (A) are multiplied by two in pairs,
Figure 706393DEST_PATH_IMAGE067
the number of dimensions representing the encoding is indicated,
Figure 837161DEST_PATH_IMAGE068
the expression of the normalization function is used,
Figure 241597DEST_PATH_IMAGE069
indicates whether to proceed or not
Figure 90604DEST_PATH_IMAGE070
Figure 871479DEST_PATH_IMAGE071
Finger-shaped
Figure 856752DEST_PATH_IMAGE072
The matrix is a matrix of a plurality of matrices,V l to represent
Figure 432090DEST_PATH_IMAGE059
At the network layerlWhen performing attention operation, inputV
The model is given the ability of Seq2Seq by the Attention, as exemplified below. If the input is "what is different" and the target sentence is "what is different", then the two sentences are spelled into one: [ CLS ] what this differs [ SEP ], followed by the Attention Mask of FIG. 3; it should be noted that [ CLS ] is what differentiates [ SEP ] between the several tokens is a two-way Attentment, and the several tokens are one-way Attentment, allowing recursive prediction of what different [ SEP ] of the several tokens are.
Further, the pooling calculation is performed on the last layer of bert, which specifically includes: performing maximum pooling and average pooling on the final code of the last layer of the network and splicing the two results, wherein the expression is as follows:
Figure 258139DEST_PATH_IMAGE073
in the above formula, the first and second carbon atoms are,
Figure 577125DEST_PATH_IMAGE074
the weight parameter is represented by a weight value,
Figure 682485DEST_PATH_IMAGE075
representing the final encoding of the last layer of the network,
Figure 428724DEST_PATH_IMAGE076
the maximum pooling operation is indicated by the number of pools,
Figure 252323DEST_PATH_IMAGE077
represents an average pooling operation;
further, calculating a loss function of the encoder through a softmax function and cross entropy, wherein the expression is as follows:
Figure 109421DEST_PATH_IMAGE078
in the above formula, the first and second carbon atoms are,
Figure 69287DEST_PATH_IMAGE079
represents a cross-entropy loss function of the entropy of the sample,
Figure 252006DEST_PATH_IMAGE080
representing similar sentences versus corresponding true tags.
The method comprises the steps of conducting back propagation according to a loss function, guiding the training of an encoder, and obtaining the encoder with certain similar text matching capability through the training; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted.
After the supervised training of the encoder is completed, a contrast learning step is further performed:
the encoder with similar text matching capability is obtained through the encoder fine tuning step, and in order to perform large-scale similar text matching, a comparison learning mode is further used on the encoder.
First, an input is given
Figure 297323DEST_PATH_IMAGE054
Constructing positive samples by means of data enhancement
Figure 958111DEST_PATH_IMAGE055
The data enhancement mode comprises but is not limited to one of synonym replacement, sentence truncation, reverse translation, punctuation mark addition, unimportant word deletion and word order rearrangement; this embodiment describes in detail a data enhancement method for synonym replacement, which includes the following sub-steps:
to pair
Figure 38063DEST_PATH_IMAGE081
Performing word segmentation to obtain
Figure 126104DEST_PATH_IMAGE082
Selecting from the set of synonym-appearing words
Figure 658717DEST_PATH_IMAGE083
Sets, e.g. of
Figure 123196DEST_PATH_IMAGE084
(ii) a It should be noted that the synonym data set in this embodiment is a synonym data set formed by combining a hayagar synonym set with homemade data.
Further, a regeneration rate less than
Figure 57654DEST_PATH_IMAGE083
Random number of set length
Figure 316597DEST_PATH_IMAGE085
E.g. of
Figure 837971DEST_PATH_IMAGE086
(ii) a By uniform distribution of
Figure 840562DEST_PATH_IMAGE083
Synonym replacement is performed in a set, and the expression is as follows:
Figure 629526DEST_PATH_IMAGE087
in the above formula, the first and second carbon atoms are,
Figure 324950DEST_PATH_IMAGE088
to represent
Figure 832154DEST_PATH_IMAGE089
The length of the set.
Then these synonyms are randomly selected from the synonym data set for replacement, so as to complete data enhancement and construct positive sample
Figure 638436DEST_PATH_IMAGE055
. It should be noted that, by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model.
Further, unsupervised comparative learning is carried out, and positive and negative examples are constructed in the following manner:
will be provided with
Figure 547486DEST_PATH_IMAGE081
And
Figure 148232DEST_PATH_IMAGE055
inputting into a coder for fitting training to obtain two expression vectors
Figure 142733DEST_PATH_IMAGE056
And
Figure 221547DEST_PATH_IMAGE090
as a positive example pair, randomly sampling another input within the batch
Figure 985104DEST_PATH_IMAGE049
As
Figure 287909DEST_PATH_IMAGE054
Negative example of (c).
Further, the similarity between the expression vector and other vectors in the batch is calculated respectively, in this embodiment, the similarity calculation adopts a cosine similarity calculation mode, and may also be a similarity calculation based on euclidean distance, a jeldaka similarity calculation, or the like; the cosine similarity is calculated as an example for explanation; and ordering the candidate texts by taking the cosine similarity obtained by calculation as a retrieval matching score, and calculating a final loss function of the model through a softmax function and a cross entropy, wherein the obtained final loss function is represented as:
Figure 504127DEST_PATH_IMAGE091
in the above formula, the first and second carbon atoms are,
Figure 4696DEST_PATH_IMAGE092
indicates an index only wheni = jIs equal to 1 in time, and is,
Figure 212823DEST_PATH_IMAGE040
representing the total number of sentences in the batch,
Figure 683381DEST_PATH_IMAGE041
the temperature is shown to be over-temperature,
Figure 369577DEST_PATH_IMAGE093
representing cosine similarity
Figure 842147DEST_PATH_IMAGE094
Figure 221176DEST_PATH_IMAGE044
Representing cosine similarity
Figure 411985DEST_PATH_IMAGE045
Figure 901873DEST_PATH_IMAGE095
Representing cosine similarity
Figure 228949DEST_PATH_IMAGE047
Figure 778879DEST_PATH_IMAGE048
To represent
Figure 456985DEST_PATH_IMAGE049
Positive sample
Figure 750563DEST_PATH_IMAGE096
Is used to represent a vector of (a) a,
Figure 197725DEST_PATH_IMAGE097
to represent
Figure 652977DEST_PATH_IMAGE049
Negative sample
Figure 818379DEST_PATH_IMAGE052
Is used to represent a vector of (a) a,
Figure 650069DEST_PATH_IMAGE053
representing the training parameters of the model.
Further, the iterative training of the network parameters is guided according to the final loss function, in this embodiment, the temperature super-parameter
Figure 217316DEST_PATH_IMAGE041
The setting was made to be 0.05,
Figure 843470DEST_PATH_IMAGE098
is set to 128; experiments prove that the model can be fitted only by one epoch through the setting, and the method can obtain higher similarity scores for similar sentences with different lengths.
Further, text retrieval is performed based on the trained model.
The embodiment of the invention also provides a text retrieval system based on the attention mask and the comparative learning pre-training model, which is applied to the method, and comprises the following steps:
the coder fine-tuning module is used for performing self-attention and local mask processing on a batch of labeled similar sentences through the coder by using the pre-training language model as the coder;
performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;
a contrast learning module for giving input
Figure 997633DEST_PATH_IMAGE054
Constructing positive samples by means of data enhancement
Figure 633013DEST_PATH_IMAGE055
Will be
Figure 789188DEST_PATH_IMAGE054
And
Figure 851822DEST_PATH_IMAGE055
inputting into a coder for fitting training to obtain two expression vectors
Figure 726237DEST_PATH_IMAGE056
And
Figure 165309DEST_PATH_IMAGE057
respectively calculating cosine similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the cosine similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function of a softmax function and a cross entropy calculation model;
and the text retrieval module is used for performing text retrieval based on the trained model.
An embodiment of the present invention further provides an electronic device, including:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the method as described above.
The embodiment of the present invention further provides a readable storage medium, in which an execution instruction is stored, and the execution instruction is used for implementing the method described above when executed by a processor.
In summary, the method and the device can enhance the generalization ability of the model by adding the supervised training of the labeled samples and only using a small amount of labeled data; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted; by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model; the method can obtain higher similarity scores for similar sentences with different lengths.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A text retrieval method is characterized by comprising the following steps:
s1, a pretrained language model with a Bert structure is used as an encoder, and self-attention and local mask processing is performed on a batch of labeled similar sentences through the encoder, and the method specifically comprises the following steps: inputting a batch of similar sentences subjected to labeling processing into Bert, and performing local mask by using an attention layer;
s2, performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;
s3, giving input
Figure 2282DEST_PATH_IMAGE001
Constructing positive samples by means of data enhancement
Figure 56826DEST_PATH_IMAGE002
Will be
Figure 486670DEST_PATH_IMAGE001
And
Figure 259454DEST_PATH_IMAGE002
inputting into a coder for fitting training to obtain two expression vectors
Figure 65736DEST_PATH_IMAGE003
And
Figure 210672DEST_PATH_IMAGE004
s4, respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function of a softmax function and a cross entropy calculation model;
and S5, performing text retrieval based on the trained model.
2. The text retrieval method of claim 1, wherein the pre-trained language model employs one of Bert, Roberta, or tiny _ Bert.
3. The text retrieval method of claim 1, wherein the batch of similar sentences after label processing is input into Bert, and local masking is performed by using an attention layer, and the expression is as follows:
Figure 76996DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,
Figure 805918DEST_PATH_IMAGE006
wherein
Figure 415891DEST_PATH_IMAGE007
Respectively represent
Figure 179448DEST_PATH_IMAGE008
The sequence of vectors of (a) is,
Figure 951095DEST_PATH_IMAGE009
the representation of the network layer is shown,
Figure 432891DEST_PATH_IMAGE010
the output of the upper layer is represented as,
Figure 580976DEST_PATH_IMAGE011
a representation of a trainable parameter is provided that,
Figure 199039DEST_PATH_IMAGE012
to represent
Figure 141587DEST_PATH_IMAGE013
The vectors of (A) are multiplied by two in pairs,
Figure 845101DEST_PATH_IMAGE014
the number of dimensions representing the encoding is indicated,
Figure 796877DEST_PATH_IMAGE015
the expression of the normalization function is used,
Figure 269446DEST_PATH_IMAGE016
indicates whether to proceed or not
Figure 648475DEST_PATH_IMAGE017
Figure 573706DEST_PATH_IMAGE018
Finger-shaped
Figure 63593DEST_PATH_IMAGE019
The matrix is a matrix of a plurality of matrices,V l to represent
Figure 656248DEST_PATH_IMAGE006
At the network layerlWhen performing attention operation, inputV
4. The text retrieval method of claim 1 wherein the expression for maximum pooling and average pooling of final codes of the last layer of the network and stitching of the two results is as follows:
Figure 707643DEST_PATH_IMAGE020
in the above formula, the first and second carbon atoms are,
Figure 120170DEST_PATH_IMAGE021
the weight parameter is represented by a weight value,
Figure 413748DEST_PATH_IMAGE022
representing the final encoding of the last layer of the network,
Figure 860910DEST_PATH_IMAGE023
the maximum pooling operation is indicated by the number of pools,
Figure 581741DEST_PATH_IMAGE024
represents an average pooling operation;
the expression of the loss function of the encoder calculated by the softmax function and the cross entropy is as follows:
Figure 12723DEST_PATH_IMAGE025
in the above formula, the first and second carbon atoms are,
Figure 844412DEST_PATH_IMAGE026
represents a cross-entropy loss function of the entropy of the sample,
Figure 880501DEST_PATH_IMAGE027
representing similar sentences versus corresponding true tags.
5. The method of claim 1, wherein the data enhancement mode includes but is not limited to one of synonym substitution, sentence truncation, reverse translation, punctuation addition, trivial word deletion, and word rearrangement.
6. The text retrieval method of claim 1, wherein the data-enhanced manner employs synonym substitution, wherein the data-enhanced manner-based construction of positive examples
Figure 772234DEST_PATH_IMAGE002
The method comprises the following steps:
to pair
Figure 159353DEST_PATH_IMAGE028
Performing word segmentation, and selecting word components appearing in the synonym data set from the word segmentation
Figure 794734DEST_PATH_IMAGE029
Collecting;
generate a ratio of less than
Figure 216488DEST_PATH_IMAGE030
Random number of set length
Figure 279122DEST_PATH_IMAGE031
With uniform distribution from
Figure 153537DEST_PATH_IMAGE032
Synonym replacement is performed in a set, and the expression is as follows:
Figure 327029DEST_PATH_IMAGE033
in the above formula, the first and second carbon atoms are,
Figure 603290DEST_PATH_IMAGE034
to represent
Figure 338290DEST_PATH_IMAGE030
The length of the set.
7. The text retrieval method of claim 1, wherein the final loss function through the softmax function and the cross-entropy computation model is expressed as:
Figure 700001DEST_PATH_IMAGE035
in the above formula, the first and second carbon atoms are,
Figure 73531DEST_PATH_IMAGE038
indicates an index only wheni = jIs equal to 1 in time, and is,
Figure 477967DEST_PATH_IMAGE039
representing the total number of sentences in the batch,
Figure 61395DEST_PATH_IMAGE040
the temperature is shown to be over-temperature,
Figure 842269DEST_PATH_IMAGE041
representing cosine similarity
Figure 93122DEST_PATH_IMAGE042
Figure 668460DEST_PATH_IMAGE043
Representing cosine similarity
Figure 739184DEST_PATH_IMAGE044
Figure 58170DEST_PATH_IMAGE045
Representing cosine similarity
Figure 163529DEST_PATH_IMAGE046
Figure 909768DEST_PATH_IMAGE047
To represent
Figure 733368DEST_PATH_IMAGE048
Positive sample
Figure 590466DEST_PATH_IMAGE049
Is used to represent a vector of (a) a,
Figure 815911DEST_PATH_IMAGE050
to represent
Figure 234516DEST_PATH_IMAGE048
Negative sample
Figure 279832DEST_PATH_IMAGE051
Is used to represent a vector of (a) a,
Figure 940621DEST_PATH_IMAGE052
representing the training parameters of the model.
8. A text retrieval system applied to the method of any one of claims 1 to 7, comprising:
the encoder fine-tuning module is used for utilizing a pretrained language model of a Bert structure as an encoder, and performing self-attention and local mask processing on a batch of labeled similar sentences through the encoder, and specifically comprises: inputting a batch of similar sentences subjected to labeling processing into Bert, and performing local mask by using an attention layer;
performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;
a contrast learning module for giving input
Figure 20572DEST_PATH_IMAGE053
Constructing positive samples by means of data enhancement
Figure 108614DEST_PATH_IMAGE054
Will be
Figure 641226DEST_PATH_IMAGE053
And
Figure 105706DEST_PATH_IMAGE054
inputting into a coder for fitting training to obtain two expression vectors
Figure 40164DEST_PATH_IMAGE055
And
Figure 299107DEST_PATH_IMAGE056
respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function through a softmax function and a final loss function of a cross entropy calculation model;
and the text retrieval module is used for performing text retrieval based on the trained model.
9. An electronic device, comprising:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of claims 1 to 7.
10. A readable storage medium having stored therein execution instructions, which when executed by a processor, are configured to implement the method of any one of claims 1 to 7.
CN202111609947.9A 2021-12-27 2021-12-27 Text retrieval method, system, equipment and storage medium Active CN114003698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111609947.9A CN114003698B (en) 2021-12-27 2021-12-27 Text retrieval method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111609947.9A CN114003698B (en) 2021-12-27 2021-12-27 Text retrieval method, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114003698A CN114003698A (en) 2022-02-01
CN114003698B true CN114003698B (en) 2022-04-01

Family

ID=79932070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111609947.9A Active CN114003698B (en) 2021-12-27 2021-12-27 Text retrieval method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114003698B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416930A (en) * 2022-02-09 2022-04-29 上海携旅信息技术有限公司 Text matching method, system, device and storage medium under search scene
CN114780709B (en) * 2022-03-22 2023-04-07 北京三快在线科技有限公司 Text matching method and device and electronic equipment
CN114817494B (en) * 2022-04-02 2024-06-21 华南理工大学 Knowledge search type dialogue method based on pre-training and attention interaction network
CN114428850B (en) * 2022-04-07 2022-08-05 之江实验室 Text retrieval matching method and system
CN114490950B (en) * 2022-04-07 2022-07-12 联通(广东)产业互联网有限公司 Method and storage medium for training encoder model, and method and system for predicting similarity
CN114742018A (en) * 2022-06-09 2022-07-12 成都晓多科技有限公司 Contrast learning level coding text clustering method and system based on confrontation training
CN115952852B (en) * 2022-12-20 2024-03-12 北京百度网讯科技有限公司 Model training method, text retrieval method, device, electronic equipment and medium
CN116384379A (en) * 2023-06-06 2023-07-04 天津大学 Chinese clinical term standardization method based on deep learning

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287494A (en) * 2019-07-01 2019-09-27 济南浪潮高新科技投资发展有限公司 A method of the short text Similarity matching based on deep learning BERT algorithm
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN111460303A (en) * 2020-03-31 2020-07-28 拉扎斯网络科技(上海)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111488474A (en) * 2020-03-21 2020-08-04 复旦大学 Fine-grained freehand sketch image retrieval method based on attention enhancement
CN111737507A (en) * 2020-06-23 2020-10-02 浪潮集团有限公司 Single-mode image Hash retrieval method
CN112001162A (en) * 2020-07-31 2020-11-27 银江股份有限公司 Intelligent judging system based on small sample learning
CN112632216A (en) * 2020-12-10 2021-04-09 深圳得理科技有限公司 Deep learning-based long text retrieval system and method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN113761935A (en) * 2021-08-04 2021-12-07 厦门快商通科技股份有限公司 Short text semantic similarity measurement method, system and device
CN113837576A (en) * 2021-09-14 2021-12-24 上海任意门科技有限公司 Method, computing device, and computer-readable storage medium for content recommendation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10580137B2 (en) * 2018-01-30 2020-03-03 International Business Machines Corporation Systems and methods for detecting an indication of malignancy in a sequence of anatomical images
US11734352B2 (en) * 2020-02-14 2023-08-22 Naver Corporation Cross-modal search systems and methods
CN111340907A (en) * 2020-03-03 2020-06-26 曲阜师范大学 Text-to-image generation method of self-adaptive attribute and instance mask embedded graph

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287494A (en) * 2019-07-01 2019-09-27 济南浪潮高新科技投资发展有限公司 A method of the short text Similarity matching based on deep learning BERT algorithm
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN111488474A (en) * 2020-03-21 2020-08-04 复旦大学 Fine-grained freehand sketch image retrieval method based on attention enhancement
CN111460303A (en) * 2020-03-31 2020-07-28 拉扎斯网络科技(上海)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111737507A (en) * 2020-06-23 2020-10-02 浪潮集团有限公司 Single-mode image Hash retrieval method
CN112001162A (en) * 2020-07-31 2020-11-27 银江股份有限公司 Intelligent judging system based on small sample learning
CN112632216A (en) * 2020-12-10 2021-04-09 深圳得理科技有限公司 Deep learning-based long text retrieval system and method
CN113486645A (en) * 2021-06-08 2021-10-08 浙江华巽科技有限公司 Text similarity detection method based on deep learning
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN113761935A (en) * 2021-08-04 2021-12-07 厦门快商通科技股份有限公司 Short text semantic similarity measurement method, system and device
CN113837576A (en) * 2021-09-14 2021-12-24 上海任意门科技有限公司 Method, computing device, and computer-readable storage medium for content recommendation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Siamese ElECTRA Network Combined with BERT for Semantic Similarity;Yuan Zhu 等;《2021 16th International Conference on Computer Science & Education》;20211122;481-485 *
基于时间翘曲距离的短文本语义相似度研究;李星;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20201215(第12期);I138-563 *
基于神经网络的短文本语义相似度计算方法研究;杨晨;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200715(第07期);I138-1595 *
文本增强与预训练语言模型在网络问政留言分类中的集成对比研究;施国良 等;《图书情报工作》;20210705;第65卷(第13期);96-107 *

Also Published As

Publication number Publication date
CN114003698A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN114003698B (en) Text retrieval method, system, equipment and storage medium
US11501182B2 (en) Method and apparatus for generating model
CN108268444B (en) Chinese word segmentation method based on bidirectional LSTM, CNN and CRF
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111914067B (en) Chinese text matching method and system
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN112487820B (en) Chinese medical named entity recognition method
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN106484682A (en) Based on the machine translation method of statistics, device and electronic equipment
CN109543017A (en) Legal issue keyword generation method and its system
CN113641809B (en) Intelligent question-answering method based on XLnet model and knowledge graph
CN115688879A (en) Intelligent customer service voice processing system and method based on knowledge graph
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN114429132A (en) Named entity identification method and device based on mixed lattice self-attention network
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN114781380A (en) Chinese named entity recognition method, equipment and medium fusing multi-granularity information
CN115496072A (en) Relation extraction method based on comparison learning
CN114579741B (en) GCN-RN aspect emotion analysis method and system for fusing syntax information
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN114510946A (en) Chinese named entity recognition method and system based on deep neural network
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN112926323B (en) Chinese named entity recognition method based on multistage residual convolution and attention mechanism
CN116720519B (en) Seedling medicine named entity identification method
CN114386425B (en) Big data system establishing method for processing natural language text content
CN114357166B (en) Text classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant