CN115168537B

CN115168537B - Training method and device for semantic retrieval model, electronic equipment and storage medium

Info

Publication number: CN115168537B
Application number: CN202210769033.7A
Authority: CN
Inventors: 曲瑛琪; 王海峰; �田�浩; 吴华; 吴甜; 刘璟; 丁宇辰; 邢毅然
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-06-27
Anticipated expiration: 2042-06-30
Also published as: JP2024006944A; CN115168537A

Abstract

The disclosure provides a training method and device for a semantic retrieval model, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence such as machine learning and natural language processing. The specific implementation scheme is as follows: obtaining target query sentence types corresponding to each original semantic retrieval model in at least two original semantic retrieval models, wherein the target query sentence types corresponding to the original semantic retrieval models are query sentence types with highest accuracy in the processing of various types of query sentences by the original semantic retrieval models; obtaining a distillation data set based on at least two original semantic retrieval models, target query statement types corresponding to the original semantic retrieval models and a pre-established corpus; based on the distillation data set, training the target semantic retrieval model. The technology can enable the trained target semantic retrieval model to integrate the retrieval capabilities of at least two original semantic retrieval models, overcome the defect of a single semantic retrieval model and improve the accuracy of semantic retrieval.

Description

Training method and device for semantic retrieval model, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence such as machine learning, natural language processing and the like, and particularly relates to a training method and device of a semantic retrieval model, electronic equipment and a storage medium.

Background

In the information age, people hope to quickly find the information needed by themselves from massive books, web pages and documents. Recall candidates from large-scale data, aided with reordering to score confidence in the recalled data, has become the dominant mode of current information retrieval.

There are generally two different ways in which the recall phase of a retrieval task: sparse vector based retrieval and dense vector based retrieval. The query sentence query and the candidate corpus are encoded into sparse vectors based on a sparse vector retrieval mode, and the dimensions of the vectors are usually the size of a dictionary. This approach relies primarily on literal matching to make the similarity calculation. Common algorithms include BM25 and the like, where the semantic retrieval model corresponding to the sparse vector based retrieval is not learnable. This way has strong migration capability and is not limited to a specific field. Based on the retrieval mode of the dense vector, the query sentence query and the candidate corpus are respectively encoded into two vectors in the semantic space through the corresponding semantic retrieval model, and similarity calculation is carried out based on the vectors so as to recall the related result. In this method, the semantic search model needs to be trained by training data, and the degree of matching can be determined by using semantic information, but the migration ability is poor.

Disclosure of Invention

The disclosure provides a training method and device of a semantic retrieval model, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of a semantic retrieval model, including:

obtaining target query sentence types corresponding to each original semantic retrieval model in at least two original semantic retrieval models, wherein the target query sentence types corresponding to the original semantic retrieval models are query sentence types with highest accuracy in various types of query sentences processed by the original semantic retrieval models;

obtaining a distillation data set based on at least two original semantic retrieval models, a target query statement type corresponding to each original semantic retrieval model and a pre-established corpus;

based on the distillation data set, training the target semantic retrieval model.

According to another aspect of the present disclosure, there is provided a training apparatus of a semantic retrieval model, including:

the type acquisition module is used for acquiring a target query statement type corresponding to each original semantic retrieval model in at least two original semantic retrieval models, wherein the target query statement type corresponding to the original semantic retrieval model is the query statement type with highest accuracy in various types of query statement processing by the original semantic retrieval model;

The data acquisition module is used for acquiring a distillation data set based on at least two original semantic retrieval models and a pre-established corpus;

and the training module is used for training the target semantic retrieval model based on the distillation data set.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aspects and methods of any one of the possible implementations described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspects and any one of the possible implementations described above.

According to the technology disclosed by the invention, the trained target semantic retrieval model can be fused with the retrieval capability of at least two original semantic retrieval models, the defect of a single semantic retrieval model is overcome, and the accuracy of semantic retrieval is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

fig. 4 is a schematic diagram of a training method of the semantic search model of the present embodiment;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing the methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.

It should be noted that, the terminal device in the embodiments of the present disclosure may include, but is not limited to, smart devices such as a mobile phone, a personal digital assistant (Personal Digital Assistant, PDA), a wireless handheld device, and a Tablet Computer (Tablet Computer); the display device may include, but is not limited to, a personal computer, a television, or the like having a display function.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

In the prior art, a retrieval mode based on sparse vectors and a retrieval mode based on dense vectors are usually used independently, but the retrieval mode based on sparse vectors can only model literal matching, lacks semantic understanding of contents and has poor effect; and the retrieval mode of the dense vector is used alone, so that some literally matched information is missing. In short, using either of the above search methods alone results in poor accuracy of semantic search.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure; as shown in fig. 1, the present embodiment provides a training method for a semantic retrieval model, which specifically includes the following steps:

s101, obtaining target query sentence types corresponding to each original semantic retrieval model in at least two original semantic retrieval models;

the target query sentence type corresponding to the original semantic retrieval model is the query sentence type with highest accuracy in various types of query sentences processed by the original semantic retrieval model.

S102, obtaining a distillation data set based on at least two original semantic retrieval models, target query statement types corresponding to the original semantic retrieval models and a pre-established corpus;

s103, training a target semantic retrieval model based on the distillation data set.

The execution subject of the training method of the semantic search model of the embodiment may be a training device of the semantic search model, and the device may be an electronic entity; or can also be an application adopting software integration, and when in use, the application runs on computer equipment to realize training of the semantic retrieval model.

The at least two semantic search models of the present embodiment may include a sparse vector based semantic search model and a dense vector based semantic search model. The semantic retrieval model based on the sparse vector can be based on the semantic vector on the word of any corpus such as dictionary coding query sentence query and/or candidate corpus. The semantic retrieval model based on the dense vector is based on a pre-trained neural network model to realize semantic vectors on the semantic level of the coded query statement query and the candidate corpus. For example, the dense vector based semantic retrieval model of the present embodiment may be implemented based on the RocketQAv2, colBERT, phrase-BERT, or COIL model, or the like.

That is, the at least two original semantic search models of the present embodiment may include at least two of a BM25 model, a RocketQAv2, colBERT, phrase-BERT, and a COIL model, among others. Moreover, the original semantic retrieval models used in this embodiment are known or already trained.

Since each original semantic retrieval model processes different types of query sentences, the corresponding accuracy is also different. For query statement types which are good at processing, the original semantic retrieval model is high in accuracy in processing; whereas the original semantic retrieval model is less accurate at the time of processing for types of query statements that are not good at processing. Based on this, in this embodiment, when selecting the target query statement type corresponding to each original semantic search model, the corresponding target query statement type may be obtained based on the accuracy of each original semantic search model in processing query statements corresponding to various query statement types. For example, the most accurate query statement type may be selected as the target query statement type. Optionally, in this embodiment, one, two or multiple target query statement types corresponding to each original semantic retrieval model may be obtained according to actual requirements. For example, if only one target query sentence type is obtained, the selectable target query sentence corpus is too low, and some target query sentence types may be obtained more. It can also be understood that the target query sentence type is the type of the query sentence which the original semantic retrieval model is good at processing, so that the characteristics of the corresponding original semantic retrieval model can be reflected when the distilled data is acquired based on the target query sentence type, and further, the characteristics of the original semantic retrieval model can be learned by the target semantic retrieval model when the target semantic model is trained based on the distilled data.

The target query statement type of the present embodiment may be an address class, a find answer class, or a find resource class. Or in practical application, the query statement types can be divided according to the fields or scenes and the like, so that the corresponding target query statement types can be obtained.

In this embodiment, a distillation data set may be obtained based on at least two original semantic retrieval models, a target query sentence type corresponding to each original semantic retrieval model, and a corpus established in advance; because the distillation data set is screened out based on at least two original semantic search models, the characteristics of each original semantic search model can be compatible. And furthermore, the target semantic retrieval model is trained based on the distillation data set, so that the target semantic retrieval model can be fused with the characteristics of each original semantic retrieval model, the defect of inaccuracy of a single semantic retrieval model is overcome, the semantic retrieval is more accurately carried out, and the accuracy of recall results is improved.

According to the training method of the semantic retrieval model, a distillation data set is obtained based on at least two original semantic retrieval models and a pre-established corpus; and further based on the distillation data set, training the target semantic retrieval model, so that the trained target semantic retrieval model can integrate the retrieval capacities of at least two original semantic retrieval models, the defect of a single semantic retrieval model is overcome, and the accuracy of semantic retrieval is improved.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. The training method of the semantic search model of the present embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the training method of the semantic search model of the present embodiment specifically may include the following steps:

s201, obtaining target query sentence types corresponding to each original semantic retrieval model in at least two original semantic retrieval models;

s202, acquiring target query sentences corresponding to the original semantic retrieval models based on the target query sentence types corresponding to the original semantic retrieval models;

s203, recalling a preset number of recalled corpora from the corpus based on each original semantic retrieval model and target query sentences corresponding to each original semantic retrieval model;

s204, generating a distillation data set based on target query sentences corresponding to the original semantic retrieval models and a preset number of recall corpora corresponding to the recall;

in order to enable the generated distillation data in the distillation data set to reflect the characteristics of each original semantic retrieval model more accurately, in the embodiment, query sentences which are good in number of each original semantic retrieval model are selected as much as possible as target query queries for corpus recall. Based on the target query and the recall corpus corresponding to the semantic retrieval models, the retrieval capability and the retrieval characteristics of the corresponding semantic retrieval models can be reflected. Therefore, in this embodiment, the distillation data set is generated based on the target query sentence corresponding to each original semantic search model and the recalled corpus of the preset number of recalls.

In this embodiment, step S201 obtains the target query statement type corresponding to each original semantic search model, and may include any of the following modes when in specific implementation:

the first mode is to acquire target query statement types corresponding to each original semantic retrieval model based on a test set corresponding to various types of query statements established in advance;

in a first approach, corresponding test sets may be pre-established based on different types of query statements. For example, different types of query statements may include an address class, a find answer class, or a find resource class. Or the types of query statements may be divided by domain or scene.

In this manner, a test set corresponding to various types of query statements may be employed to detect the accuracy of each original semantic retrieval model. If the accuracy is greater than a preset accuracy threshold, such as 90%, 95% or other ratio, the query statement type may be used as the target query statement type corresponding to the original semantic retrieval model. Or the various query statement types can be sequenced according to the order of the accuracy from high to low, and the target query statement type with the highest accuracy can be obtained. In this embodiment, the target query sentence types corresponding to each original semantic search model may include one, two or more types, which are not limited herein.

And in a second mode, acquiring the target query statement type corresponding to each original semantic retrieval model based on the attribute of each original semantic retrieval model.

In this implementation, the attributes of the original semantic retrieval model may be predefined. Specifically, the attribute of the original semantic search model may be configured based on the type, domain or scene of the training set when the model is trained, and the attribute may identify the type, domain or scene of training data used by the original semantic search model when training, which indicates that the original semantic search model is good at processing query terms of the type, domain or scene, and the accuracy of processing query terms of the type, domain or scene is highest relative to processing query terms of other types. Wherein the training set types may also include an address class, a find answer class, or a find resource class. Based on the above, the attribute of the original semantic retrieval model can be used as the target query statement type corresponding to the original semantic retrieval model.

In this embodiment, no matter what way is adopted, the target query statement type corresponding to each original semantic retrieval model can be accurately obtained.

And then, based on the target query statement types corresponding to the original semantic retrieval models, acquiring target query statements corresponding to the original semantic retrieval models. For example, various types of query statements may be collected based on historical behavioral data of the user and stored in a corpus of query statements. When the method is used, based on the target query statement type, any corresponding query statement is obtained from the query statement corpus and used as the target query statement. Or may acquire the target query statement in other manners. For example, any query statement of a corresponding type is obtained from the user's log directly based on the target query statement type as the target query statement.

In this embodiment, when generating the distilled dataset, according to the size of the distilled dataset to be generated, the above steps S202 and S203 may be repeated in the above manner, to obtain each target query sentence corresponding to each original semantic search model, and a preset number of recalled corpora recalled from the corpus for each target query sentence. That is, for a target query term, an original semantic retrieval model may recall a predetermined number of recalled corpora from the corpus. The preset number in this embodiment may be set according to actual requirements. For example, the corpus may be recalled for 100, 80, 50, 20, or others.

In this embodiment, the target query sentence corresponding to each original semantic search model and the recalled corpus with the preset number of recalls may be directly used as distillation data, and added into the distillation data set. Since the recall corpus is obtained according to the target query statement based on the original semantic retrieval model, the distilled data is taken as positive samples in the distilled data set, so that the correlation degree between the target query statement and the recall corpus is required to be configured to be 1, the distilled data is marked as the positive samples, and the recall corpus has the probability of being recalled of 1 when the target query statement is queried.

In contrast, when the distillation data of the negative sample in the distillation data set is constructed, a negative sample corpus which cannot be recalled can be constructed by referring to the target query statement corresponding to the original semantic retrieval model and the recalled preset number of recalled corpora, and the correlation degree between the target query statement and the negative sample corpus is configured to be 0 so as to identify that the distillation data is a negative sample, and the probability that the negative sample corpus is recalled is 0 when the target query statement is queried can be represented.

Steps S201-S204 are one implementation of step S101 in the embodiment shown in fig. 1.

S205, training a target semantic retrieval model based on the distillation data set.

According to the training method of the semantic retrieval model, a distillation data set is generated by acquiring target query sentences corresponding to each original semantic retrieval model and a preset number of recall corpora corresponding to the recall, and the target semantic retrieval model is further trained based on the distillation data set. Because the obtained target query sentences corresponding to the original semantic retrieval models and the recall corpus with the preset number corresponding to the recall can fully embody the capability and performance of the original semantic retrieval models, the generated distillation data set can carry the characteristics of the original semantic retrieval models, and when the target semantic retrieval models are trained based on the distillation data set, the trained target semantic retrieval models can be enabled to fuse the retrieval capability of at least two original semantic retrieval models, the defect of a single semantic retrieval model is overcome, and the accuracy of semantic retrieval is effectively improved.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. The training method of the semantic search model of the present embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 3, the training method of the semantic search model of the present embodiment specifically may include the following steps:

S301, acquiring a target query sentence type corresponding to each original semantic retrieval model in at least two original semantic retrieval models based on a test set corresponding to various pre-established query sentences;

the specific implementation may refer to the relevant descriptions of the embodiment shown in fig. 2, and will not be repeated here.

S302, acquiring target query sentences corresponding to the original semantic retrieval models based on the target query sentence types corresponding to the original semantic retrieval models;

s303, recalling a preset number of recalled corpora from the corpus based on each original semantic retrieval model and target query sentences corresponding to each original semantic retrieval model;

s304, screening target query sentences corresponding to each original semantic retrieval model and recalled preset number of recalled corpora by adopting a pre-trained fine-ranking model to generate a distillation data set;

steps S301-S304 are one implementation of step S101 in the embodiment shown in fig. 1.

Unlike the embodiment shown in fig. 2, in this embodiment, when the distilled dataset is generated by screening, a fine-ranking model is further adopted to screen the target query sentences and the recalled preset number of recalled corpora corresponding to each original semantic retrieval model, so as to generate the distilled dataset. For example, fig. 4 is a schematic diagram of a training method of the semantic search model of the present embodiment. Correspondingly, the architecture diagram of the embodiment shown in fig. 2 may be the structure diagram of fig. 4, in which screening of recall results by the fine-pitch model is omitted.

The original semantic retrieval model and the target semantic retrieval model which adopt the neural network structure can adopt a model with a double-tower structure, so that the query sentence query and the candidate corpus are respectively encoded, and the vector similarity of the query sentence query and the candidate corpus is calculated based on the encoding result. The fine-pitch model of the embodiment can be obtained by training in advance by adopting the labeling data. The fine-ranking model can model interaction information between the query statement query and the candidate corpus, and has stronger capability than a corresponding model of a double-tower structure. Therefore, recall results of the original semantic retrieval models are screened and filtered by using the fine-ranking model, and the quality of distillation data in the distillation data set can be effectively improved.

In one embodiment of the present disclosure, step S304 adopts a pre-trained refined-rank model to screen target query sentences corresponding to each original semantic search model and a preset number of recalled corpora to generate a distilled dataset, and specifically may include the following steps:

(a1) Screening positive sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and a preset number of recalled corpora;

(b1) Screening negative sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and a preset number of recalled corpora;

(c1) Positive sample data and negative sample data are stored in a distillation data set.

For example, in this embodiment, positive sample data and negative sample data may be screened as distillation data based on the fine-pitch model, respectively, and a distillation data set may be formed together. Specifically, the ratio of the positive sample data to the negative sample data in the distillation data set may be set according to actual requirements, for example, may be 1:1, may be 1:2, 1:3, 1:4, or may be other ratios, which are not limited herein, and the corresponding sample data may be screened according to the number of required samples.

In this embodiment, when generating the distillation data set, the party fusing the samples or the distillation data may not identify the positive sample data and the negative sample data, for example, each piece of distillation data may include a query sentence, at least two recall corpuses, and a relevance ranking of at least two recall corpuses and the query sentence. The distilled data in the form can also be soft tag data, and the target semantic retrieval model can be trained through the distilled data in the form, so that the target semantic retrieval model can learn scoring sequences of relevance of different recall corpus and the same query statement. In this embodiment, step S304 adopts a pre-trained fine-ranking model to screen the target query sentences corresponding to each original semantic retrieval model and the recalled corpus of a preset number of recalls, so as to generate a distilled dataset, which may have the following three cases:

First case, direct merging;

in this case, positive sample data and negative sample data need to be acquired in the above manner based on each original semantic retrieval model; and then, directly combining positive sample data and negative sample data generated by each original semantic retrieval model in at least two original semantic retrieval models to obtain a final distillation data set. That is, the distilled data set obtained in this case includes the positive sample data and the negative sample data distilled out by each original semantic retrieval model.

For example, correspondingly, step (a 1) in this case may comprise the steps of:

(a2) For target query sentences corresponding to each original semantic retrieval model, calculating relevance scores of each recall corpus of the target query sentences and a preset number of corresponding recall corpuses by adopting a fine-ranking model;

(b2) Deleting the recalled corpus with the relevance score smaller than a preset threshold value from the first N pieces of the preset number of recalled corpus; wherein N is a positive integer greater than 1;

(c2) And constructing positive sample data based on the target query statement and the rest recall corpus of the first N pieces of recall corpus in the preset quantity.

For example, the positive sample data constructed at this time may include the target query sentence, and the recall corpus with the relevance score greater than or equal to the preset threshold value in the first N pieces of recall corpus in the preset number. Because the sample is to be used as a positive sample, and the target semantic retrieval model is trained, the label of the relevance score of the positive sample data can be reconfigured to be 1 at the moment, so that the target semantic retrieval model can learn the capability of recalling the corpus in the corresponding positive sample data based on the target query statement.

For example, correspondingly, step (b 1) in this case may comprise the steps of:

(a3) Selecting the recall corpus with the relevance score smaller than a preset threshold value from the recall corpus after the (n+1) th in the preset number of recall corpus for the target query statement corresponding to each original semantic retrieval model; n is a positive integer greater than 1;

(b3) And constructing negative sample data based on the target query statement and the recall corpus with the relevance score smaller than a preset threshold value selected from the N+1th recall corpus in the preset number of recall corpora.

Similarly, the negative sample data constructed at this time may include the target query sentence, and the recall corpus after the (n+1) th recall corpus in the preset number, where the relevance score is smaller than the preset threshold. Because the sample is to be used as a negative sample, the target semantic retrieval model is trained, the label of the relevance score of the negative sample data can be reconfigured to be 0 at the moment, so that the target semantic retrieval model can learn the capability of recalling corpus in the corresponding negative sample data based on the target query statement without recalling.

For example, taking 20 as an example when the preset number is 100. For the first 100 recalled corpora recalled by each original semantic retrieval model aiming at any target query statement, the recalled corpora with the relevance score smaller than a preset threshold value in the first 20 can be taken, so that the quality of positive samples is improved. And negative sample data can be constructed by taking recalled corpus with relatedness score smaller than a preset threshold value from 20 th to 100 th. The preset threshold may be empirically set, such as 0.1, 0.2, or other values. By adopting the mode, the quality of distillation data in the distillation data set can be effectively improved.

Second case, cross-merging;

in this case, the generated positive sample data for each original semantic retrieval model may be taken and the distilled dataset stored. While negative sample data may be screened from all recall results of at least two original semantic retrieval models. That is, the distilled dataset obtained in this case fully includes positive sample data distilled out of each original semantic retrieval model, but negative sample data may only be distilled out of a part of the original semantic retrieval models.

For example, at this time, correspondingly, step (b 1) filters negative sample data based on the refined model, the target query statement corresponding to each original semantic retrieval model, and a preset number of recalled corpora, and may specifically include the following steps:

(a4) For target query sentences corresponding to each original semantic retrieval model, calculating relevance scores of each recall corpus in the corresponding target query sentences and the corresponding preset number of recall corpuses by adopting a fine-ranking model;

(b4) And screening negative sample data from all the recalled corpora of at least two original semantic retrieval models according to the relevance scores of each target query statement and each recalled corpus in a preset mode.

For example, the required number of negative sample data may be screened in order of the relevance score from small to large. Or all recalled corpora with the relevance score smaller than a preset threshold can be directly fetched. Or negative sample data may be screened in other ways, not limited herein. According to the method, all recall corpus of at least two original semantic retrieval models can be referred to, negative sample data with higher quality can be selected, and further the quality of a distilled data set is improved.

Third case, soft label data combination;

in this case, unlike the first two cases, the target query sentence corresponding to each original semantic search model and the recall corpus of a preset number corresponding to the recall in at least two original semantic search models do not distinguish between the positive and negative samples, directly retain the original scoring or sorting results of the original semantic search models, and combine. When the distillation data set is adopted to train the target semantic retrieval model, the training target semantic retrieval model learns the scoring level or the sequencing order of the original semantic retrieval model on different sample data.

In practical applications, any one of the three conditions may be selected according to requirements to generate a distillation data set. In either way, an accurate, rational, efficient distillation dataset can be obtained.

In this embodiment, in the above manner, when generating the distilled data set, the target query statement type that each original semantic search model is good at processing is used as much as possible to generate the distilled data of the relevant type, so that the advantage of each original semantic search model is utilized to the greatest extent possible, and the distilled data with advantages is generated. For example, the RocketQAv2 model may perform better for problem-type queries, and in this step the distribution of query statements in generating the distilled data set may be adjusted, increasing the proportion of problem-type query statements processed by the RocketQAv2 model, resulting in more efficient distillation of the data. The same applies to other original semantic search models, and the types of query sentences which are good at processing are selected to obtain more effective distilled data, and are not described in detail herein.

S305, training a target semantic retrieval model based on the distillation data set.

The target semantic retrieval model of the embodiment is a model of a double-tower structure.

Different training patterns may be employed for different patterns of distillation data set generation. For example, for the first and second cases described above for generating a distillation data set. The generation modes of direct combination and cross combination can be trained by a hard label mode, a common contrast learning training mode is used, cross entropy loss is adopted, and negative sampling in batches is introduced for training. For the soft tag data merging mode, i.e., the third case of generating a distilled data set, the training may be performed by using the margin mse mode, and the scoring between the pairs of samples may be learned. The two modes can be flexibly selected according to training effects.

The characteristics of a plurality of teachers, namely original semantic retrieval models, can be integrated by using the target semantic retrieval model of the double-tower structure trained by the fused distillation data, and only the parts with advantages and large difference are reserved because the data distribution of different teachers is controlled in the distillation data output process, so that the weak tag training data output by the teachers can play a role to the greatest extent.

Through the training of the weak tag data, the effect of the target semantic retrieval model of the double-tower structure can exceed the training result of the labeling data, the target semantic retrieval model has stronger generalization, the overfitting to the labeling data is weakened, and the robustness problem caused by data deviation is solved.

According to the training method of the semantic retrieval model, through the fine-ranking model, target query sentences of each original semantic retrieval model and the recall corpus corresponding to the preset number of recalls are obtained, a distillation data set is generated, the quality of the distillation data set is further effectively improved, and further the target semantic retrieval model is trained based on the distillation data set, so that the accuracy of the trained target semantic retrieval model is better.

According to the training method of the semantic retrieval model, the capability of at least two original semantic retrieval models can be integrated into the target semantic retrieval model of the double-tower structure, and the retrieval capability of the target semantic retrieval model can be effectively improved.

According to the training method for the semantic retrieval model, at least two original semantic retrieval models comprise sparse vector retrieval models, so that a target semantic retrieval model can learn certain sparse vector retrieval capability, the scene solving capability of literal fine-pitch matching is improved, the generalization capability of the target semantic retrieval model is improved, and better field migration performance under zero samples is achieved.

In addition, compared with the existing semantic retrieval model with a double-tower structure, the semantic retrieval model with the double-tower structure has the advantages that the structure of the double-tower model is not changed, the characteristics of rapid deployment and high retrieval efficiency of the double-tower model are maintained, and the semantic retrieval model with the double-tower structure can be widely applied to a large-scale retrieval scene.

According to the training method of the semantic retrieval model, when the training method is applied, a scheme for integrating the capabilities of various original semantic retrieval models can be flexibly expanded, and after a model or system with stronger capabilities in certain aspect appears, the advantage of the system can be effectively absorbed by adopting the scheme, so that the performance of a target semantic retrieval model is improved.

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure; as shown in fig. 5, the present embodiment provides a training apparatus 500 for a semantic search model, including:

The type obtaining module 501 is configured to obtain a target query statement type corresponding to each original semantic search model in at least two original semantic search models, where the target query statement type corresponding to the original semantic search model is a query statement type with highest accuracy in processing query statements of various types by the original semantic search model;

the data acquisition module 502 is configured to acquire a distillation data set based on at least two original semantic retrieval models, a target query sentence type corresponding to each original semantic retrieval model, and a pre-established corpus;

a training module 503, configured to train the target semantic retrieval model based on the distillation data set.

The training device 500 for the semantic search model according to the present embodiment implements the implementation principle and the technical effect of training the semantic search model by using the above modules, and is the same as the implementation of the above related method embodiment, and details of the above related method embodiment may be referred to in the description of the above related method embodiment, which is not repeated herein.

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure; the present embodiment provides a training device 600 for a semantic search model, which further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 5. As shown in fig. 6, the present embodiment provides a training device 600 for a semantic search model, which includes the same-name and same-function modules shown in fig. 5, a type acquisition module 601, a data acquisition module 602, and a training module 603.

Wherein, the data acquisition module 602 includes:

a sentence acquisition unit 6021, configured to acquire a target query sentence corresponding to each original semantic search model based on a target query sentence type corresponding to each original semantic search model;

a corpus acquisition unit 6022, configured to recall a preset number of recalled corpora from the corpus based on each original semantic retrieval model and the target query statement corresponding to each original semantic retrieval model;

the generating unit 6023 is configured to generate a distillation dataset based on the target query sentence corresponding to each original semantic search model and the recall corpus corresponding to the preset number of recalls.

Further, in one embodiment of the present disclosure, a type acquisition module 601 is configured to:

and acquiring the types of target query sentences corresponding to each original semantic retrieval model based on a test set corresponding to various pre-established query sentences.

and acquiring the target query statement type corresponding to each original semantic retrieval model based on the attribute of each original semantic retrieval model.

Further, in an embodiment of the present disclosure, the generating unit 6023 is configured to:

And screening target query sentences corresponding to each original semantic retrieval model and a preset number of recall corpora corresponding to the recall by adopting a pre-trained refined-ranking model to generate a distillation data set.

screening positive sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and a preset number of recall corpora corresponding to recall;

screening negative sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and a preset number of recall corpora corresponding to recall;

positive sample data and negative sample data are stored in a distillation data set.

for target query sentences corresponding to each original semantic retrieval model, calculating relevance scores of each recall corpus in the target query sentences and the corresponding recalled preset number of recall corpuses by adopting a fine-ranking model;

deleting the recalled corpus with the relevance score smaller than a preset threshold value from the first N pieces of the preset number of recalled corpus; wherein N is a positive integer greater than 1;

And constructing positive sample data based on the target query statement and the rest recall corpus of the first N pieces of recall corpus in the preset quantity.

selecting the recall corpus with the relevance score smaller than a preset threshold value from the recall corpus after the (n+1) th in the preset number of recall corpus for the target query statement corresponding to each original semantic retrieval model; wherein N is a positive integer greater than 1;

and constructing negative sample data based on the target query statement and the recall corpus with the relevance score smaller than a preset threshold value selected from the N+1th recall corpus in the preset number of recall corpora.

for target query sentences corresponding to each original semantic retrieval model, calculating relevance scores of each recall corpus in the corresponding target query sentences and the corresponding preset number of recall corpuses by adopting a fine-ranking model;

and screening negative sample data from all the recalled corpora of at least two original semantic retrieval models according to the relevance scores of each target query statement and each recalled corpus in a preset mode.

The training device 600 for the semantic search model according to the present embodiment implements the implementation principle and the technical effect of training the semantic search model by using the above modules, and is the same as the implementation of the above related method embodiment, and details of the above related method embodiment may be referred to in the description of the above related method embodiment, which is not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the above-described methods of the present disclosure. For example, in some embodiments, the above-described methods of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the above-described methods of the present disclosure described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the above-described methods of the present disclosure by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a semantic retrieval model, comprising:

obtaining target query sentence types corresponding to each original semantic retrieval model in at least two original semantic retrieval models, wherein the target query sentence types corresponding to the original semantic retrieval models are query sentence types with highest accuracy in various types of query sentences processed by the original semantic retrieval models; the target query statement type comprises an addressing class, a searching answer class or a searching resource class;

training a target semantic retrieval model based on the distillation data set;

based on at least two original semantic search models, a target query statement type corresponding to each of the original semantic search models and a pre-established corpus, obtaining a distilled dataset comprises:

acquiring target query sentences corresponding to the original semantic retrieval models based on the target query sentence types corresponding to the original semantic retrieval models;

recall a preset number of recalled corpora from the corpus based on each of the original semantic retrieval models and the target query statements corresponding to each of the original semantic retrieval models;

and generating the distillation data set based on the target query statement corresponding to each original semantic retrieval model and the preset number of recall corpora corresponding to recall.

2. The method of claim 1, wherein obtaining a target query statement type corresponding to each of at least two original semantic search models comprises:

And acquiring the target query statement types corresponding to the original semantic retrieval models based on a test set corresponding to various pre-established query statements.

3. The method of claim 1, wherein obtaining a target query statement type corresponding to each of at least two original semantic search models comprises:

4. The method of claim 1, wherein generating the distilled dataset based on the target query statement for each of the original semantic retrieval models and the preset number of recall corpora for the corresponding recall comprises:

and screening target query sentences corresponding to the original semantic retrieval models and the recalled corpus of the preset number corresponding to the recalled corpus by adopting a pre-trained fine-ranking model to generate the distillation data set.

5. The method of claim 4, wherein filtering, with a pre-trained refined model, the target query statement corresponding to each of the original semantic search models and the preset number of recall corpora corresponding to recalls, generating the distilled dataset comprises:

Screening positive sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and the preset number of recall corpora corresponding to recall;

screening negative sample data based on the fine-ranking model, target query sentences corresponding to the original semantic retrieval models and the preset number of recall corpora corresponding to recall;

and storing the positive sample data and the negative sample data in the distillation data set.

6. The method of claim 5, wherein filtering positive sample data based on the refined model, target query statements corresponding to each of the original semantic retrieval models, and the preset number of recall corpora corresponding to recalls, comprises:

for target query sentences corresponding to the original semantic retrieval models, calculating relevance scores of the recall corpora in the target query sentences and the preset number of recall corpora corresponding to recall by adopting the fine-ranking model;

and constructing the positive sample data based on the target query statement and the rest recall corpora of the first N recall corpora in the preset number.

7. The method of claim 5, wherein filtering negative sample data based on the refined model, target query statements corresponding to each of the original semantic retrieval models, and the preset number of recall corpora corresponding to recalls, comprises:

selecting the recall corpus with the relevance score smaller than a preset threshold value from the recall corpus after the (n+1) th of the preset number of recall corpus for the target query statement corresponding to each original semantic retrieval model; wherein N is a positive integer greater than 1;

and constructing negative sample data based on the target query sentence and the recall corpus with the relevance score smaller than a preset threshold value selected from the recall corpus after the (n+1) th recall corpus in the preset number.

8. The method of claim 5, wherein filtering negative sample data based on the refined model, target query statements corresponding to each of the original semantic retrieval models, and the preset number of recall corpora corresponding to recalls, comprises:

for the target query sentences corresponding to the original semantic retrieval models, calculating the relevance scores of the recall corpora in the corresponding target query sentences and the corresponding recall corpora in the preset number by adopting the fine-ranking model;

And screening the negative sample data from all the recalled corpora of the at least two original semantic retrieval models according to the relevance scores of the target query sentences and the recalled corpora in a preset mode.

9. A training device for a semantic retrieval model, comprising:

the type acquisition module is used for acquiring a target query statement type corresponding to each original semantic retrieval model in at least two original semantic retrieval models, wherein the target query statement type corresponding to the original semantic retrieval model is the query statement type with highest accuracy in various types of query statement processing by the original semantic retrieval model; the target query statement type comprises an addressing class, a searching answer class or a searching resource class;

the data acquisition module is used for acquiring a distillation data set based on at least two original semantic retrieval models, a target query statement type corresponding to each original semantic retrieval model and a pre-established corpus;

the training module is used for training the target semantic retrieval model based on the distillation data set;

the data acquisition module comprises:

the sentence acquisition unit is used for acquiring target query sentences corresponding to the original semantic retrieval models based on the target query sentence types corresponding to the original semantic retrieval models;

The corpus acquisition unit is used for recalling a preset number of recalled corpora from the corpus based on the original semantic retrieval models and target query sentences corresponding to the original semantic retrieval models;

the generation unit is used for generating the distillation data set based on the target query sentences corresponding to the original semantic retrieval models and the recall corpora with the preset number corresponding to the recall.

10. The apparatus of claim 9, wherein the type acquisition module is configured to:

11. The apparatus of claim 9, wherein the type acquisition module is configured to:

12. The apparatus of claim 9, wherein the generating unit is configured to:

13. The apparatus of claim 12, wherein the generating unit is configured to:

14. The apparatus of claim 13, wherein the generating unit is configured to:

15. The apparatus of claim 13, wherein the generating unit is configured to:

selecting the recall corpus with the relevance score smaller than a preset threshold value from the recall corpus after the (n+1) th in the preset number of recall corpus for the query sentence corresponding to each original semantic retrieval model; wherein N is a positive integer greater than 1;

16. The apparatus of claim 13, wherein the generating unit is configured to:

17. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.