CN114003698B

CN114003698B - Text retrieval method, system, equipment and storage medium

Info

Publication number: CN114003698B
Application number: CN202111609947.9A
Authority: CN
Inventors: 郭湘; 黄鹏; 江岭
Original assignee: Chengdu Xiaoduo Technology Co ltd
Current assignee: Chengdu Xiaoduo Technology Co ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-01
Anticipated expiration: 2041-12-27
Also published as: CN114003698A

Abstract

The invention provides a text retrieval method, a system, a device and a storage medium, comprising the following steps: using a pre-training language model as an encoder, and performing self-attention and mask processing on a batch of labeled similar sentence pairs through the encoder; performing pooling treatment on the final code, and guiding training according to a cross entropy loss function; constructing positive samples for input through data enhancement

Will be

And

input to the encoder to obtain a representation vector

And

(ii) a Calculating the similarity between the expression vector and other vectors in the batch, sorting the candidate texts according to the similarity, and obtaining the candidate texts byThe final loss function guides the iterative training of the network parameters; and performing text retrieval based on the trained model. The method can enhance the generalization capability of the model by adding the supervised training with labeled samples; based on an attention mask mechanism, the model has similarity text reasoning capability; based on comparative learning, the model is made to have text retrieval capability in an unsupervised manner.

Description

Text retrieval method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text retrieval method, a system, equipment and a storage medium.

Background

Under the current internet, searching has become the main way people know the world, and the most central to searching is the judgment of the relevance of texts. The foregoing text similarity determination mainly includes labeling similar text pairs, and performing supervised training on models, such as conventional two-tower models DSSM, ESIM, and the like, but the foregoing method has at least the following disadvantages: (1) the labeling consumes a large amount of manpower, and is not suitable for large-scale text retrieval tasks, such as hundred-degree search; (2) data outside the field is not applicable because the data distribution and the labeling do not conform; (3) the model is very difficult to update, the natural language changes day by day, and when a new word is involved, the new word needs to be labeled and trained again to complete the text retrieval.

Disclosure of Invention

The invention aims to provide a text retrieval method, a system, equipment and a storage medium, which only need a small amount of labeled data for supervised learning, perform unsupervised training on a large batch of data sets in a comparative learning mode, are suitable for text retrieval in the whole field and aim to solve the problems pointed out in the background art.

The embodiment of the invention is realized by the following technical scheme: a text retrieval method comprises the following steps:

s1, a pre-training language model is used as an encoder, and self-attention and local mask processing is carried out on a batch of similar sentences marked by the encoder;

s2, performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;

s3, giving input

Constructing positive samples by means of data enhancement

Will be

And

inputting into a coder for fitting training to obtain two expression vectors

And

；

s4, respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function of a softmax function and a cross entropy calculation model;

and S5, performing text retrieval based on the trained model.

Further, the pre-training language model adopts one of Bert, Roberta or tiny _ Bert.

According to a preferred embodiment, the pre-training language model adopts a Bert structure, wherein the self-attention and local mask processing on a batch of labeled similar sentences by an encoder comprises:

inputting a batch of similar sentences which are subjected to labeling processing into Bert, and performing local masking by using an attention layer, wherein the expression is as follows:

in the above formula, the first and second carbon atoms are,

wherein

Respectively represent

The sequence of vectors of (a) is,

the representation of the network layer is shown,

the output of the upper layer is represented as,

a representation of a trainable parameter is provided that,

to represent

The vectors of (A) are multiplied by two in pairs,

the number of dimensions representing the encoding is indicated,

the expression of the normalization function is used,

indicates whether to proceed or not

，

Finger-shaped

The matrix is a matrix of a plurality of matrices,V _lto represent

At the network layerlWhen performing attention operation, inputV。

According to a preferred embodiment, the expression for maximum pooling and average pooling of the final codes of the last layer of the network and splicing the two results is as follows:

in the above formula, the first and second carbon atoms are,

the weight parameter is represented by a weight value,

representing the final encoding of the last layer of the network,

the maximum pooling operation is indicated by the number of pools,

represents an average pooling operation;

the expression of the loss function of the encoder calculated by the softmax function and the cross entropy is as follows:

in the above formula, the first and second carbon atoms are,

represents a cross-entropy loss function of the entropy of the sample,

representing similar sentences versus corresponding true tags.

Further, the data enhancement mode includes but is not limited to one of synonym replacement, sentence truncation, reverse translation, punctuation addition, unimportant word deletion and word order rearrangement.

According to a preferred embodiment, the data-enhanced method uses synonym substitution, wherein the positive examples are constructed in a data-enhanced manner

The method comprises the following steps:

to pair

Performing word segmentation, and selecting word components appearing in the synonym data set from the word segmentation

Collecting;

generate a ratio of less than

Random number of set length

With uniform distribution from

Synonym replacement is performed in a set, and the expression is as follows:

in the above formula, the first and second carbon atoms are,

to represent

The length of the set.

According to a preferred embodiment, the final loss function of the model is calculated by a softmax function and a cross entropy, and the obtained final loss function is represented as:

in the above formula, the first and second carbon atoms are,

indicates an index only wheni = jIs equal to 1 in time, and is,

representing the total number of sentences in the batch,

the temperature is shown to be over-temperature,

representing cosine similarity

，

Representing cosine similarity

，

Representing cosine similarity

，

To represent

Positive sample

Is used to represent a vector of (a) a,

to represent

Negative sample

Is used to represent a vector of (a) a,

representing the training parameters of the model.

The invention also provides a text retrieval system applied to the method, which comprises the following steps:

the coder fine-tuning module is used for performing self-attention and local mask processing on a batch of labeled similar sentences through the coder by using the pre-training language model as the coder;

performing maximum pooling and average pooling on the final code of the last layer of the network, splicing the two results, calculating a loss function of the encoder through a softmax function and a cross entropy, and guiding the training of the encoder according to the loss function;

a contrast learning module for giving input

Constructing positive samples by means of data enhancement

Will be

And

inputting into a coder for fitting training to obtain two expression vectors

And

；

respectively calculating the similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function through a softmax function and a final loss function of a cross entropy calculation model;

and the text retrieval module is used for performing text retrieval based on the trained model.

The present invention also provides an electronic device comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the method as described above.

The invention also provides a readable storage medium having stored therein executable instructions, which when executed by a processor, are adapted to implement the method as described above.

The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects: by adding supervised training with labeled samples, the generalization capability of the model can be enhanced only by using a small amount of labeled data; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted; by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model; the method can obtain higher similarity scores for similar sentences with different lengths.

Drawings

Fig. 1 is a schematic flowchart of a text retrieval method according to embodiment 1 of the present invention;

FIG. 2 is a diagram of a model framework provided in embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of an Attention Mask provided in example 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Example 1

Through research of the applicant, the traditional two-tower models DSSM, ESIM and the like are mainly characterized by labeling similar text pairs, so that the models are trained supervised, but the above method has at least the following disadvantages: (1) the labeling consumes a large amount of manpower and is not suitable for large-scale search tasks, such as hundred-degree search; (2) data outside the field is not applicable because the data distribution and the labeling do not conform; (3) the model is very difficult to update, the natural language changes day by day, and when a new word is involved, the new word needs to be labeled and trained again to complete the text retrieval.

Therefore, the embodiment of the invention provides a text retrieval method based on an attention mask and a comparison learning pre-training model, only a small amount of labeled data is needed for supervised learning, and unsupervised training is carried out on a large amount of data sets in a comparison learning mode, so that the method is suitable for text retrieval in the whole field and aims to solve the problems pointed out in the background technology. The specific scheme is as follows:

referring to fig. 1, the text retrieval method based on the attention mask and the contrast learning pre-training model mainly includes two steps, an encoder fine tuning step and a contrast learning step, which are described in detail below.

Firstly, an encoder fine tuning step is executed:

using a pre-training language model as an encoder, and performing self-attention and local mask processing on a batch of labeled similar sentences through the encoder; it should be noted that, in the application, by adding supervised training with labeled samples, the generalization capability of the model can be enhanced only by using a small amount of labeled data.

In an implementation manner of this embodiment, the pre-training language model adopts a Bert structure, and in addition, the pre-training language model may also adopt Roberta or tiny _ Bert, which is not described in detail again.

The embodiment is mainly illustrated by bert of a transform, wherein the processing, by an encoder, of self-attention and local masking on a batch of labeled similar sentences specifically includes:

referring to fig. 2, a batch of similar sentences after label processing is input into Bert, and local masking is performed by using an attention layer, and the expression is as follows:

in the above formula, the first and second carbon atoms are,

wherein

Respectively represent

The sequence of vectors of (a) is,

the representation of the network layer is shown,

the output of the upper layer is represented as,

a representation of a trainable parameter is provided that,

to represent

The vectors of (A) are multiplied by two in pairs,

the number of dimensions representing the encoding is indicated,

the expression of the normalization function is used,

indicates whether to proceed or not

，

Finger-shaped

The matrix is a matrix of a plurality of matrices,V _lto represent

At the network layerlWhen performing attention operation, inputV。

The model is given the ability of Seq2Seq by the Attention, as exemplified below. If the input is "what is different" and the target sentence is "what is different", then the two sentences are spelled into one: [ CLS ] what this differs [ SEP ], followed by the Attention Mask of FIG. 3; it should be noted that [ CLS ] is what differentiates [ SEP ] between the several tokens is a two-way Attentment, and the several tokens are one-way Attentment, allowing recursive prediction of what different [ SEP ] of the several tokens are.

Further, the pooling calculation is performed on the last layer of bert, which specifically includes: performing maximum pooling and average pooling on the final code of the last layer of the network and splicing the two results, wherein the expression is as follows:

in the above formula, the first and second carbon atoms are,

the weight parameter is represented by a weight value,

representing the final encoding of the last layer of the network,

the maximum pooling operation is indicated by the number of pools,

represents an average pooling operation;

further, calculating a loss function of the encoder through a softmax function and cross entropy, wherein the expression is as follows:

in the above formula, the first and second carbon atoms are,

represents a cross-entropy loss function of the entropy of the sample,

representing similar sentences versus corresponding true tags.

The method comprises the steps of conducting back propagation according to a loss function, guiding the training of an encoder, and obtaining the encoder with certain similar text matching capability through the training; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted.

After the supervised training of the encoder is completed, a contrast learning step is further performed:

the encoder with similar text matching capability is obtained through the encoder fine tuning step, and in order to perform large-scale similar text matching, a comparison learning mode is further used on the encoder.

First, an input is given

Constructing positive samples by means of data enhancement

The data enhancement mode comprises but is not limited to one of synonym replacement, sentence truncation, reverse translation, punctuation mark addition, unimportant word deletion and word order rearrangement; this embodiment describes in detail a data enhancement method for synonym replacement, which includes the following sub-steps:

to pair

Performing word segmentation to obtain

Selecting from the set of synonym-appearing words

Sets, e.g. of

(ii) a It should be noted that the synonym data set in this embodiment is a synonym data set formed by combining a hayagar synonym set with homemade data.

Further, a regeneration rate less than

Random number of set length

E.g. of

(ii) a By uniform distribution of

Synonym replacement is performed in a set, and the expression is as follows:

in the above formula, the first and second carbon atoms are,

to represent

The length of the set.

Then these synonyms are randomly selected from the synonym data set for replacement, so as to complete data enhancement and construct positive sample

. It should be noted that, by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model.

Further, unsupervised comparative learning is carried out, and positive and negative examples are constructed in the following manner:

will be provided with

And

inputting into a coder for fitting training to obtain two expression vectors

And

as a positive example pair, randomly sampling another input within the batch

As

Negative example of (c).

Further, the similarity between the expression vector and other vectors in the batch is calculated respectively, in this embodiment, the similarity calculation adopts a cosine similarity calculation mode, and may also be a similarity calculation based on euclidean distance, a jeldaka similarity calculation, or the like; the cosine similarity is calculated as an example for explanation; and ordering the candidate texts by taking the cosine similarity obtained by calculation as a retrieval matching score, and calculating a final loss function of the model through a softmax function and a cross entropy, wherein the obtained final loss function is represented as:

in the above formula, the first and second carbon atoms are,

indicates an index only wheni = jIs equal to 1 in time, and is,

representing the total number of sentences in the batch,

the temperature is shown to be over-temperature,

representing cosine similarity

，

Representing cosine similarity

，

Representing cosine similarity

，

To represent

Positive sample

Is used to represent a vector of (a) a,

to represent

Negative sample

Is used to represent a vector of (a) a,

representing the training parameters of the model.

Further, the iterative training of the network parameters is guided according to the final loss function, in this embodiment, the temperature super-parameter

The setting was made to be 0.05,

is set to 128; experiments prove that the model can be fitted only by one epoch through the setting, and the method can obtain higher similarity scores for similar sentences with different lengths.

Further, text retrieval is performed based on the trained model.

The embodiment of the invention also provides a text retrieval system based on the attention mask and the comparative learning pre-training model, which is applied to the method, and comprises the following steps:

a contrast learning module for giving input

Constructing positive samples by means of data enhancement

Will be

And

inputting into a coder for fitting training to obtain two expression vectors

And

；

respectively calculating cosine similarity of the expression vector and other vectors in the batch, sequencing the candidate texts by taking the cosine similarity as a retrieval matching score, and guiding iterative training of network parameters according to a final loss function of a softmax function and a cross entropy calculation model;

An embodiment of the present invention further provides an electronic device, including:

a memory storing execution instructions; and

The embodiment of the present invention further provides a readable storage medium, in which an execution instruction is stored, and the execution instruction is used for implementing the method described above when executed by a processor.

In summary, the method and the device can enhance the generalization ability of the model by adding the supervised training of the labeled samples and only using a small amount of labeled data; the encoder is based on an attention mask mechanism, the model has similarity text reasoning capability, and vector representation of the similarity text can be predicted; by adding synonym replacement, the model can improve the score of synonym similarity, and further improve the text matching capability of the model; the method can obtain higher similarity scores for similar sentences with different lengths.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text retrieval method is characterized by comprising the following steps:

s1, a pretrained language model with a Bert structure is used as an encoder, and self-attention and local mask processing is performed on a batch of labeled similar sentences through the encoder, and the method specifically comprises the following steps: inputting a batch of similar sentences subjected to labeling processing into Bert, and performing local mask by using an attention layer;

s3, giving input

Constructing positive samples by means of data enhancement

Will be

And

inputting into a coder for fitting training to obtain two expression vectors

And

；

and S5, performing text retrieval based on the trained model.

2. The text retrieval method of claim 1, wherein the pre-trained language model employs one of Bert, Roberta, or tiny _ Bert.

3. The text retrieval method of claim 1, wherein the batch of similar sentences after label processing is input into Bert, and local masking is performed by using an attention layer, and the expression is as follows:

in the above formula, the first and second carbon atoms are,

wherein

Respectively represent

The sequence of vectors of (a) is,

the representation of the network layer is shown,

the output of the upper layer is represented as,

a representation of a trainable parameter is provided that,

to represent

The vectors of (A) are multiplied by two in pairs,

the number of dimensions representing the encoding is indicated,

the expression of the normalization function is used,

indicates whether to proceed or not

，

Finger-shaped

The matrix is a matrix of a plurality of matrices,V _lto represent

At the network layerlWhen performing attention operation, inputV。

4. The text retrieval method of claim 1 wherein the expression for maximum pooling and average pooling of final codes of the last layer of the network and stitching of the two results is as follows:

in the above formula, the first and second carbon atoms are,

the weight parameter is represented by a weight value,

representing the final encoding of the last layer of the network,

the maximum pooling operation is indicated by the number of pools,

represents an average pooling operation;

in the above formula, the first and second carbon atoms are,

represents a cross-entropy loss function of the entropy of the sample,

representing similar sentences versus corresponding true tags.

5. The method of claim 1, wherein the data enhancement mode includes but is not limited to one of synonym substitution, sentence truncation, reverse translation, punctuation addition, trivial word deletion, and word rearrangement.

6. The text retrieval method of claim 1, wherein the data-enhanced manner employs synonym substitution, wherein the data-enhanced manner-based construction of positive examples

The method comprises the following steps:

to pair

Collecting;

generate a ratio of less than

Random number of set length

With uniform distribution from

Synonym replacement is performed in a set, and the expression is as follows:

in the above formula, the first and second carbon atoms are,

to represent

The length of the set.

7. The text retrieval method of claim 1, wherein the final loss function through the softmax function and the cross-entropy computation model is expressed as:

in the above formula, the first and second carbon atoms are,

indicates an index only wheni = jIs equal to 1 in time, and is,

representing the total number of sentences in the batch,

the temperature is shown to be over-temperature,

representing cosine similarity

，

Representing cosine similarity

，

Representing cosine similarity

，

To represent

Positive sample

Is used to represent a vector of (a) a,

to represent

Negative sample

Is used to represent a vector of (a) a,

representing the training parameters of the model.

8. A text retrieval system applied to the method of any one of claims 1 to 7, comprising:

the encoder fine-tuning module is used for utilizing a pretrained language model of a Bert structure as an encoder, and performing self-attention and local mask processing on a batch of labeled similar sentences through the encoder, and specifically comprises: inputting a batch of similar sentences subjected to labeling processing into Bert, and performing local mask by using an attention layer;

a contrast learning module for giving input

Constructing positive samples by means of data enhancement

Will be

And

inputting into a coder for fitting training to obtain two expression vectors

And

；

9. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of claims 1 to 7.

10. A readable storage medium having stored therein execution instructions, which when executed by a processor, are configured to implement the method of any one of claims 1 to 7.