CN115328434A

CN115328434A - Search result sorting method and device and electronic equipment

Info

Publication number: CN115328434A
Application number: CN202210774652.5A
Authority: CN
Inventors: 陈武亚; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-11-11

Abstract

The invention provides a method and a device for sequencing search results and electronic equipment, wherein after a plurality of search results corresponding to a target search problem are obtained, the target search problem and the corresponding search results are input into a sequencing model trained in advance, and semantic feature vectors of the target search problem and the search results are extracted through the sequencing model; and determining a ranking result of the plurality of search results based on the target search question and the semantic feature vectors of the plurality of search results. The search results are ranked by adopting the ranking model established based on the self-coding language model and obtained by training the positive sample and the negative samples of the answers of the questions, and the ranking model can output the ranking feature vector which accurately represents the semantic features of the search results, so that the ranking reasonability of the search results is improved.

Description

Search result sorting method and device and electronic equipment

Technical Field

The invention relates to the technical field of retrieval, in particular to a method and a device for sorting search results and electronic equipment.

Background

In a search system, the sorting process of search results is very important, and the search experience of a user is influenced to a great extent. The traditional sorting algorithm comprises a Boolean model, a vector space model, a BM25 and the like, the models depend on the accurate matching of keywords to a great extent, and when semantic matching needs to be considered, the sorting result is poor in reasonability. The advent of self-coding language models has enabled rapid development of natural language understanding. After the self-coding language model is trained by sample data in the general field, the model outputs a characteristic vector representing the similarity of search results, and then ranks the search results, but when the search content belongs to a special field in a professional field, the ranking results are still poor in reasonability.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, and an electronic device for sorting search results, so as to improve the reasonableness of sorting the search results in the search process.

In a first aspect, an embodiment of the present invention provides a method for ranking search results, where the method includes: obtaining a plurality of search results corresponding to the target search problem; inputting the target search problem and a plurality of corresponding search results into a pre-trained sequencing model, and extracting semantic feature vectors of the target search problem and the search results through the sequencing model; determining a ranking result of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results; wherein the sequencing model is established based on a self-coding language model; the sequencing model is obtained by training a plurality of training data including sample problems, positive samples corresponding to the sample problems and a plurality of negative samples; the positive sample includes answers to the sample questions.

The ranking model is trained in the following way; determining training data from pre-constructed sample data; the training data comprises a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample includes answers to the sample questions; inputting training data into an initial model, and determining semantic features of sample problems, positive samples and negative samples in the training data through the initial model; establishing an initial model based on a self-coding language model; calculating a loss value of the initial model based on the semantic features of the sample problem, the positive sample and the negative sample; updating model parameters of the initial model based on the loss values; and continuing to execute the step of determining training data from the sample data until the loss value is converged, and determining the initial model with the converged loss value as a sequencing model.

The sample data is constructed by the following method, including: searching document data containing the problems and answers of the problems from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the problems, and determining the problems as sample problems; the public data set includes a plurality of document data; determining the similarity between the positive sample and the document data except the positive sample in the public data set; determining a plurality of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; the preset conditions include: the answer to the question is not included in the document data, and the similarity between the positive sample and the document data satisfies the set similarity sorting condition.

The step of determining the similarity between the positive sample and the document data of the public data set except the positive sample includes: coding a plurality of document data in the problem and public data set based on a preset coarse-grained semantic understanding model, and determining coarse-grained semantic feature vectors of the problem and the document data; and calculating the similarity between the positive sample and the document data of the positive sample in the public data set based on the coarse-grained semantic feature vector of the document data of the positive sample in the public data set and the coarse-grained semantic feature vector of the problem.

The similarity ranking conditions include: the sorting position of the document data of the positive-removing sample in the public data set in a sorting result generated based on the similarity of the document data is less than or equal to a preset target position; the method for determining the plurality of document data meeting the preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the problem comprises the following steps: judging whether the document data comprises answers of the questions or not aiming at each document data of the positive samples in the public data set; if not, determining the document data as a negative sample to be selected; sequencing the negative samples to be selected according to a sequencing mode of similarity from high to low to obtain a sequencing result; and determining the negative sample to be selected with the sorting position in the sorting result smaller than or equal to the preset target position as the negative sample corresponding to the problem.

The negative samples comprise a difficult negative sample and a simple negative sample; the difficult negative sample is the negative sample with the highest similarity to the problem in the plurality of negative samples; the simple negative samples comprise negative samples of the plurality of negative samples except the difficult negative samples; the similarity between the negative sample and the positive sample is calculated and obtained based on the coarse-grained semantic feature vector of the negative sample and the coarse-grained semantic feature vector of the positive sample; calculating a loss value of the initial model based on the semantic feature vectors of the sample problem, the positive sample and the negative sample, comprising: calculating a first loss value based on the semantic feature vectors of the sample problem, the positive sample and the difficult negative sample; calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample; and calculating the loss value of the initial model based on the first loss value and the corresponding first preset weight, the second loss value and the corresponding second preset weight.

The step of calculating a first loss value based on the semantic feature vectors of the sample problem, the positive samples and the difficult negative samples includes: and calculating semantic feature vectors of the sample problem, the positive sample and the difficult negative sample based on a loss function of a pairing algorithm to obtain a first loss value.

The step of calculating the second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample includes: and calculating semantic feature vectors of the sample problem, the positive sample and the simple negative sample based on a loss function of a comparison learning algorithm to obtain a second loss value.

The step of determining a ranking result of the plurality of search results based on the target search question and the semantic feature vectors of the plurality of search results includes: aiming at each search result, calculating the semantic relevance between the target search problem and the search result based on the target search problem and the semantic feature vector of the search result; and sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results.

In a second aspect, an embodiment of the present invention provides an apparatus for ranking search results, where the apparatus includes: the search result acquisition module is used for acquiring a plurality of search results corresponding to the target search problem; the feature extraction module is used for inputting the target search problem and the corresponding search results into a pre-trained sequencing model and extracting semantic feature vectors of the target search problem and the search results through the sequencing model; the sequencing result determining module is used for determining the sequencing results of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results; wherein the sequencing model is established based on a self-coding language model; the sequencing model is obtained by training a plurality of training data including sample problems, positive samples corresponding to the sample problems and a plurality of negative samples; the positive sample includes answers to the sample questions.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the above-mentioned method for sorting search results.

In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described method for ranking search results.

The embodiment of the invention has the following beneficial effects:

after a plurality of search results corresponding to a target search problem are obtained, inputting the target search problem and the corresponding search results into a pre-trained ranking model, and extracting semantic feature vectors of the target search problem and the search results through the ranking model; determining a ranking result of the plurality of search results based on the target search question and the semantic feature vectors of the plurality of search results. The search results are ranked by adopting the ranking model established based on the self-coding language model and obtained by training the positive sample and the negative samples of the answers of the questions, and the ranking model can output the ranking feature vector which accurately represents the semantic features of the search results, so that the ranking reasonability of the search results is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for ranking search results according to an embodiment of the present invention;

FIG. 2 is a flowchart of a ranking model training process according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training sample construction process provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a training process provided by an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for sorting search results according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The ranking module is an extremely important module in a search system, and greatly influences the search experience of a user. The ranking module is typically built based on a ranking algorithm. The traditional sorting algorithm comprises a Boolean model, a vector space model, a BM25 and the like, the models depend on the accurate matching of keywords (term) to a great extent, and when semantic matching needs to be considered, the effect is often poor, for example, synonyms can not be matched.

BERT (Bidirectional Encoder Representation from transforms), a pre-trained language characterization model, also known as a self-encoding language model. The advent of BERT has enabled rapid development of natural language understanding, while also affecting many areas. How to introduce the techniques of the frontiers of BERT into the sorting module to greatly improve the sorting performance has received much attention. If the similarity (for example, cosine similarity) of the search results is calculated directly based on the vectors obtained by the BERT, the similarity is used for sorting, the effect in the target field is often poor, fine adjustment needs to be performed in the target field, and the labeled data in the target field is often difficult to obtain. The target domain usually refers to a specific task or a specific form of data, which is the target domain if, for example, a search is made within a codebase. In addition, the pretraining method of BERT generally adopts corpus data in the general field to realize Next Sentence Prediction (NSP) through a Masked Language Model (MLM), and BERT trained by the method is not suitable for sorting.

In the case of no labeled data in the target domain, there are two general methods for introducing a deep learning model such as BERT into content ranking: (1) Calculating the representation of the text in the target field by using BERT, and then calculating the similarity of vectors for sequencing; (2) Performing unsupervised learning in the target field to obtain a corresponding deep learning model, and then calculating text vector representation of the target field so as to perform sequencing according to vector similarity; the more effective method is SimCSE, which works well on many benchmark datasets, more than 10 percentage points higher than method 1). The first mode has poor sequencing effect, while the second mode needs to train a deep learning model through data in the target field, but the data in the target field is difficult to obtain, so that the training efficiency of the model is low. Based on this, the search result ranking method, the search result ranking device and the electronic device provided by the embodiment of the invention can be applied to ranking of search results in various search scenes.

The embodiment of the invention provides a method for sorting search results, which comprises the following steps as shown in figure 1:

step S102, a plurality of search results corresponding to the target search question are obtained.

The target search problem may be a target search problem in a general field or a target search problem in a target field. Wherein, the general field generally corresponds to general text search, such as search through google browser, and the target search question may be "how much weather is in beijing today". The target domain may correspond to a feature task or a specific form of data, such as a search in a code library, and the target search question may be composed of a name and parameters of a function to be searched.

The search result corresponding to the target search problem can be obtained by searching from a preset database through a preset search algorithm by a search engine. The search algorithm may include a sequential search, a binary search, a semantic recall algorithm, and the like. The setting can be specifically carried out according to the target search problem and the data characteristics of the corresponding field.

When only one search result is obtained, sorting is not required. When a plurality of search results are obtained, search results with a high degree of correlation with a target search problem need to be displayed at a conspicuous page position that can be seen by a user, for example, after the user searches for the target problem through a google browser, each search result is displayed on a search result display page in sequence from high to low according to the degree of correlation between the search results and the target search problem, and the search results need to be sorted in the process.

Step S104, inputting a target search problem and a plurality of corresponding search results into a pre-trained sequencing model, and extracting semantic feature vectors of the target search problem and the plurality of search results through the sequencing model; wherein the sequencing model is established based on a self-coding language model; the sequencing model is obtained by training a plurality of training data including sample problems, positive samples corresponding to the sample problems and a plurality of negative samples; the positive sample includes answers to the sample questions.

Because the ranking model is built based on the self-coding language model, which is usually used to implement the next sentence prediction task, training data and a loss function corresponding to a task that outputs semantic features of positive and negative samples that are accurate and distinguishable need to be used to enable the ranking model to achieve the task goal.

In order to improve the accuracy of the semantic features output by the ranking model, a large amount of training data needs to be constructed. In order to enable the similarity parameter calculated based on the semantic feature vectors of the positive sample and the question output by the ranking model to indicate that the fit degree of the positive sample and the sample question is high, the answer of the sample question needs to be contained in the positive sample in the configurable training data. In order for the ranking model to learn the difference between the positive and negative examples, a plurality of negative sample data corresponding to the sample problem needs to be constructed so that the ranking model more accurately determines the difference between the positive and negative samples to different degrees.

When a public data set with a large amount of data is selected to construct training data, a plurality of document data with high correlation with sample questions can be found from the public data set by a search algorithm as samples, wherein the sample including the answer to the sample question is a positive sample, and the negative sample does not include the answer to the sample question. The negative sample with the highest semantic similarity to the sample problem in the negative samples can be used as the difficult negative sample, so that the ranking model can learn the nuance between the positive sample and the difficult negative sample more accurately.

In the training process, the loss value based on the difference between the positive and difficult negative samples can be given a larger weight. While for samples other than the difficult negative samples, the loss value due to the difference between the positive samples and these samples may be given less weight. By adjusting the parameters of the initial model through the loss value, the distance between the semantic feature vectors of the positive sample and the negative sample output by the initial model can be enlarged. After the parameters of the initial model are adjusted for many times, the loss value can reach convergence, and at this time, the trained initial model can be determined as a ranking model.

During the search process, the plurality of search results may be ranked by a ranking model. And after the target search problem and the corresponding search results are input into the trained sequencing model, semantic feature vectors of the target search problem and the search results are extracted through the sequencing model. In the search results, the higher the degree of correlation with the target search question, the higher the degree of correlation between the semantic feature vector of the search result and the semantic feature vector of the target search question, and the ranking can be performed through the degree of correlation.

And step S106, determining the sequencing results of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results.

Specifically, the distance between the semantic feature vector of the target search question and the semantic feature vector of each search result may be used as a correlation parameter between the target search question and the search result, and the smaller the distance, the higher the correlation. The angle/change matrix between the feature vectors and the like can be calculated, or the correlation degree parameter between the target search problem and the search result can be calculated by adopting a certain rule based on the parameters between the vectors. In the ranking, the search results of which the relevance parameter indicates that the relevance to the target search question is high are ranked at the top, so that a ranking result of a plurality of search results is generated.

The following embodiments provide an implementation of a training ranking model.

The ranking model is generally obtained by training an initial model established based on a self-coding language model, and as shown in fig. 2, the ranking model can be specifically trained in the following manner;

step S202, determining training data from pre-constructed sample data; the training data comprises a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample includes answers to the sample questions.

The above sample data may be constructed in the following manner:

(1) Searching document data containing the problems and answers of the problems from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the problems, and determining the problems as sample problems; the public data set includes a plurality of document data.

Specifically, a public reading comprehension data set or other data capable of having document data containing questions and answers may be selected, the data amount in these data sets needs to be sufficiently large, the types of questions are rich, and the degree of matching between the questions and the answers is high.

(2) And determining the similarity between the positive sample and the document data of the positive sample in the public data set.

Specifically, a coarse-grained semantic understanding model obtained based on BERT fine tuning may be used to encode a plurality of document data in the problem and public data sets, determine coarse-grained semantic feature vectors of the problem and the document data, and then calculate the similarity between the problem and the document data in the public data sets except the positive samples based on the coarse-grained semantic feature vectors of the document data in the public data sets and the coarse-grained semantic feature vectors of the problem. The higher the similarity is, the higher the correlation of the problem with the document data is, and the closer the document data is to the positive sample.

(3) Determining a plurality of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; wherein the preset conditions include: the answer to the question is not included in the document data, and the similarity between the positive sample and the document data satisfies the set similarity ranking condition.

Specifically, for each piece of document data of the public data set except the positive sample, it is first necessary to determine whether the piece of document data includes an answer to the question; if not, determining the document data as a negative sample to be selected; then, sequencing the negative samples to be selected according to a sequencing mode of similarity from high to low to obtain a sequencing result; and determining the negative sample to be selected with the sorting position in the sorting result smaller than or equal to the preset target position as the negative sample corresponding to the problem. Correspondingly, the similarity sorting condition is that the sorting position of the document data of the positive-dividing sample in the public data set in the sorting result generated based on the similarity of the document data is less than or equal to the preset target position. The target position may be N, the selection principle also being referred to as TopN.

The negative samples can be divided into difficult negative samples and simple negative samples; the difficult negative sample is the negative sample with the highest similarity to the problem in the plurality of negative samples; the simple negative examples include negative examples other than the difficult negative examples among the plurality of negative examples.

Step S204, inputting the training data into the initial model, and determining the semantic features of the sample problem, the positive sample and the negative sample in the training data through the initial model.

Step S206, calculating the loss value of the initial model based on the semantic feature vectors of the sample problem, the positive sample and the negative sample.

For the positive samples and the negative samples, the similarity between the positive samples and the problem and the similarity between the negative samples and the problem can be sequentially calculated, and the loss value of the initial model is determined based on the similarity and the weights set according to different positions of the negative samples in the sequencing result. In addition, two loss values can be calculated based on a difficult negative sample and a simple negative sample obtained by dividing the negative sample, specifically, a first loss value can be calculated based on the sample problem, the semantic feature vectors of the positive sample and the difficult negative sample, and a loss function of a pairing algorithm can be selected to calculate the first loss value; calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample, and selecting a loss function of a comparison learning algorithm to calculate the second loss value; and finally, calculating the loss value of the initial model based on the first loss value and the corresponding first preset weight, the second loss value and the corresponding second preset weight. The first preset weight and the first preset weight may be set according to experiments or historical experience.

In step S208, model parameters of the initial model are updated based on the loss values.

Step S210, judging whether the loss value is converged, if not, executing step S202; if so, step S212 is performed.

In step S212, the initial model after the loss value convergence is determined as a ranking model.

The initial model is trained through the training data constructed in the mode, parameters of the initial model are adjusted through the two loss values, and the search results are ranked through the ranking model obtained through training, so that a more reasonable ranking result can be obtained.

The following embodiments provide an implementation for determining a ranking result of a plurality of search results based on a target search question and semantic feature vectors of the plurality of search results.

After the search results are obtained, it is generally desirable that the search results that are more relevant to the targeted search question be displayed first, i.e., more advanced in the ranked results. Specifically, aiming at each search result, calculating the semantic relevance between the target search problem and the search result based on the target search problem and the semantic feature vector of the search result; the semantic relevance can be obtained by calculating one of an included angle between semantic feature vectors of a target search problem and a search result, a vector distance and a conversion matrix between vectors, or any of a plurality of the semantic relevance vectors according to a set weight combination; and then, sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results. In the sorting result, the arrangement position of the search result with higher relevance to the target search problem is closer to the front.

The embodiment of the invention also provides another search result sorting method, which is realized on the basis of the method shown in the figure 1. The method aims at constructing high-quality training samples in a large number of general fields, and designing and sequencing a more consistent training target on the basis, so that a sequencing model obtained by training by the method can be migrated to the target field, better performance is achieved in the target field than other strong baseline models, and better test performance is obtained. The method greatly improves the searching performance and saves manpower and material resources.

The technical problems solved by the method can be summarized as follows:

(1) Selection of general domain data: what data in the general domain is selected to achieve better migration in the target domain?

(2) Construction of training samples (equivalent to the above "sample data"): how to construct a high-quality training sample and make the model learn more knowledge and have strong robustness?

(3) Training a target: what kind of training targets are designed for the content ranking task, so as to achieve better ranking performance?

Aiming at the technical problems, the main work and innovation points of the invention are as follows:

(1) General field data selection: in order to improve the migration capability of the model, an open reading understanding data set DuReader is selected, and the method mainly has 3 characteristics: a, data come from a real scene, wherein answers are manually generated, and the quality is high; b, the problem types are rich; and c, the data size is huge.

(2) Construction of training samples: to construct high quality positive and negative samples, a document containing the answer to the question is selected as the positive sample. In order to make the negative sample more difficult, firstly, a method of combining semantics and keyword recall is selected to construct a candidate document, and then a high-quality negative sample is screened out through a threshold value and judgment of whether the answer is included, wherein a specific flow is shown in fig. 3.

(3) Training a target: as shown in fig. 4, for the characteristic of the ranking, a pairing (pair) algorithm-based mode training model is designed, so that the subtle difference between the positive sample and the difficult negative sample can be captured, and in addition, in order to take account of the simple negative sample, comparison learning is also introduced when the loss value is calculated. By combining two training targets, the method provided by the invention can achieve good sequencing performance.

The selection of the general domain data and the construction of the training samples may be collectively referred to as data construction. The emphasis of model training is the design of training targets, i.e. the pair and the contrast loss are introduced to improve the ranking capability of the model. The following describes specific embodiments of these key technologies and the resulting technical effects.

(1) Selection of general field data:

because the target field is not provided with a labeled data set, a high-quality public data set which is large in scale and rich in problem types and is based on manual labeling and attached to an actual scene can be selected as a data source for constructing a training sample. The functions of the mode include: a, the data in the general field contains wide knowledge, which is beneficial to migration; and b, based on manual labeling, the construction of a high-quality training sample is facilitated, and noise introduced in training is reduced.

(2) Construction of training samples:

based on the characteristics of the ranking task, because no document ranking data set exists, positive and negative sample pairs can be selected to be constructed, and the partial ranking relation of the documents can be learned in the form of pairwise. Where the quality of the positive and negative samples will largely determine the final performance of the model. For this reason, the document containing the answer to the question is selected as a positive sample because the document containing the answer is basically strongly correlated with the question. The negative sample construction process comprises the following steps: a recalling the candidate document through semantics and keywords, wherein the semantic recalling mode is as follows: the method comprises the steps of (1) finely adjusting a coarse-grained semantic understanding model of BERT, then coding a problem and a document library, and finally recalling based on vector similarity; and B, sorting the candidate documents according to the similarity threshold of the candidate documents and the positive sample and whether the documents contain answers, and selecting top documents as negative samples, namely the documents ranked at the first position in the sorting result. By the above-mentioned procedure, it is possible to obtain negative samples which are difficult, i.e. the samples and the problems are also relatively correlated but the correlation is weaker than the positive samples. Finally, about 40 million high quality positive and negative sample pairs can be obtained.

(3) Model training:

as shown in fig. 4, the method adopts a target combining pair and contrast loss for training, wherein in order to fit the characteristics of the ranking task, a loss with pair as a main component can be selected, specifically: lossp = max (0, - [ cos (q, doc +) -cos (q, doc-) ] + margin), wherein margin is 0.4; q, doc + and doc-represent the problem, positive and difficult negative samples, respectively. In addition, in order to consider simple negative samples, contrast learning is introduced, that is, all doc except doc + in a sample set of the same batch (batch) are regarded as negative samples and are marked as D-, specifically, a loss value of the contrast learning is calculated by adopting the following formula:

wherein t is a temperature parameter, and the purpose is to enlarge the difference between the positive and negative samples, so that the model converges more quickly. Finally, the invention weights the two kinds of loss to obtain the final training target.

After the model is trained, the model can be used for sorting in the target field.

The method constructs a training sample on the data of the general field for training, thereby achieving good performance on the target field. Not only is search experience improved, but also the workload of labeling data is reduced, and a large amount of manpower and material resources are saved. The method not only successfully falls to the ground, but also obtains better performance than a strong base line, and the specific comparison effect is as follows:

model (model)	Effect
		bert_pairwise(ours)	top1＝0.74top2＝0.87top3＝0.9
rbt_pairwise(ours)	top1＝0.68top2＝0.82top3＝0.87
		rbt_simcse(baseline)	top1＝0.48top2＝0.62top3＝0.70

Wherein, the evaluation test set is manually marked, the index is TopN, and rbt represents lightweight bert. Comparing the last two rows of the table, it can be seen that the method provided by the embodiment of the present invention has much better performance than the strong baseline, and in addition, the method does not utilize the data of the target field, but the strong baseline model is unsupervised trained on the data of the target field, which better reflects the strong migration capability and robustness of the method.

For the above method embodiment, refer to a sort apparatus of search results shown in fig. 5, the apparatus includes:

a search result obtaining module 502, configured to obtain a plurality of search results corresponding to the target search problem;

a feature extraction module 504, configured to input the target search question and the corresponding multiple search results into a pre-trained ranking model, and extract semantic feature vectors of the target search question and the multiple search results through the ranking model;

a ranking result determining module 506, configured to determine a ranking result of the multiple search results based on the target search problem and semantic feature vectors of the multiple search results; wherein the sequencing model is established based on a self-coding language model; the sequencing model is obtained by training a plurality of training data including sample problems, positive samples corresponding to the sample problems and a plurality of negative samples; the positive sample includes answers to the sample questions.

After a plurality of search results corresponding to a target search problem are obtained, the target search problem and the corresponding search results are input into a pre-trained ranking model, and semantic feature vectors of the target search problem and the search results are extracted through the ranking model; determining a ranking result of the plurality of search results based on the target search question and the semantic feature vectors of the plurality of search results. The search results are ranked by adopting the ranking model established based on the self-coding language model and obtained by training the positive sample and the negative samples of the answers of the questions, and the ranking model can output the ranking feature vector which accurately represents the semantic features of the search results, so that the ranking reasonability of the search results is improved.

The device also comprises a model training module used for training the model; determining training data from pre-constructed sample data; the training data comprises a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample includes answers to the sample questions; inputting training data into an initial model, and determining semantic features of sample problems, positive samples and negative samples in the training data through the initial model; establishing an initial model based on a self-coding language model; calculating a loss value of the initial model based on the sample problem and semantic features of the positive sample and the negative sample; updating model parameters of the initial model based on the loss values; and continuing to execute the step of determining the training data from the sample data until the loss value is converged, and determining the initial model after the loss value is converged as a sequencing model.

The apparatus also includes a sample construction module for; searching document data containing the problems and answers of the problems from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the problems, and determining the problems as sample problems; the public data set includes a plurality of document data; determining the similarity between the positive sample and the document data of the positive sample in the public data set; determining a plurality of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; the preset conditions include: the answer to the question is not included in the document data, and the similarity between the positive sample and the document data satisfies the set similarity ranking condition.

The sample construction module is further configured to: coding a plurality of document data in the problem and public data set based on a preset coarse-grained semantic understanding model, and determining coarse-grained semantic feature vectors of the problem and the document data; and calculating the similarity between the positive sample and the document data of the positive sample in the public data set based on the coarse-grained semantic feature vector of the document data of the positive sample in the public data set and the coarse-grained semantic feature vector of the problem.

The similarity ranking condition includes: the sorting position of the document data of the positive-removing sample in the public data set in a sorting result generated based on the similarity of the document data is less than or equal to a preset target position; the sample construction module described above is further configured to: judging whether the document data comprises answers of the questions or not aiming at each document data of the positive samples in the public data set; if not, determining the document data as a negative sample to be selected; sequencing the negative samples to be selected according to a sequencing mode of similarity from high to low to obtain a sequencing result; and determining the negative sample to be selected with the sorting position in the sorting result smaller than or equal to the preset target position as the negative sample corresponding to the problem.

The negative samples comprise a difficult negative sample and a simple negative sample; the difficult negative sample is the negative sample with the highest similarity to the problem in the plurality of negative samples; the simple negative samples comprise negative samples of the plurality of negative samples except the difficult negative samples; the similarity between the negative sample and the positive sample is calculated and obtained based on the coarse-grained semantic feature vector of the negative sample and the coarse-grained semantic feature vector of the positive sample; the model training module is further configured to: calculating a first loss value based on the semantic feature vectors of the sample problem, the positive sample and the difficult negative sample; calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample; and calculating the loss value of the initial model based on the first loss value and the corresponding first preset weight, the second loss value and the corresponding second preset weight.

The model training module is further configured to: and calculating semantic feature vectors of the sample problem, the positive sample and the difficult negative sample based on a loss function of the pairing algorithm to obtain a first loss value.

The model training module is further configured to: and calculating semantic feature vectors of the sample problem, the positive sample and the simple negative sample based on a loss function of a comparison learning algorithm to obtain a second loss value.

The sorting result determining module is further configured to: aiming at each search result, calculating the semantic relevance of the target search problem and the search result based on the target search problem and the semantic feature vector of the search result; and sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results.

The present embodiment further provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the above method for sorting search results, for example:

obtaining a plurality of search results corresponding to the target search problem; inputting the target search problem and a plurality of corresponding search results into a pre-trained sequencing model, and extracting semantic feature vectors of the target search problem and the search results through the sequencing model; determining a ranking result of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results; wherein the sequencing model is established based on a self-coding language model; the sequencing model is obtained by training a plurality of training data including sample problems, positive samples corresponding to the sample problems and a plurality of negative samples; the positive sample includes answers to the sample questions.

The search results are ranked by the ranking model established based on the self-coding language model and obtained by training the positive sample and the negative samples of the answers to the questions, and the ranking model can output the ranking feature vector which accurately represents the semantic features of the search results, so that the ranking reasonability of the search results is improved.

Optionally, the ranking model is trained in the following manner; determining training data from pre-constructed sample data; the training data comprises a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample includes answers to the sample questions; inputting training data into an initial model, and determining semantic features of sample problems, positive samples and negative samples in the training data through the initial model; establishing an initial model based on a self-coding language model; calculating a loss value of the initial model based on the sample problem and semantic features of the positive sample and the negative sample; updating model parameters of the initial model based on the loss values; and continuing to execute the step of determining the training data from the sample data until the loss value is converged, and determining the initial model after the loss value is converged as a sequencing model.

Optionally, the sample data is constructed in the following manner, including: searching document data containing the problems and answers of the problems from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the problems, and determining the problems as sample problems; the public data set includes a plurality of document data; determining the similarity between the positive sample and the document data except the positive sample in the public data set; determining a plurality of pieces of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; the preset conditions include: the answer to the question is not included in the document data, and the similarity between the positive sample and the document data satisfies the set similarity sorting condition.

Optionally, the step of determining the similarity between the positive sample and the document data of the public data set except the positive sample includes: coding a plurality of document data in the problem and public data set based on a preset coarse-grained semantic understanding model, and determining coarse-grained semantic feature vectors of the problem and the document data; and calculating the similarity between the positive sample and the document data of the positive sample in the public data set based on the coarse-grained semantic feature vector of the document data of the positive sample in the public data set and the coarse-grained semantic feature vector of the problem.

Optionally, the similarity ranking condition includes: the sorting position of the document data of the positive-removing sample in the public data set in a sorting result generated based on the similarity of the document data is less than or equal to a preset target position; the method for determining the plurality of document data meeting the preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the problem comprises the following steps: judging whether the document data comprises answers of the questions or not aiming at each document data of the positive removing sample in the public data set; if not, determining the document data as a negative sample to be selected; sequencing the negative samples to be selected according to a sequencing mode of similarity from high to low to obtain a sequencing result; and determining the negative sample to be selected with the sorting position in the sorting result smaller than or equal to the preset target position as the negative sample corresponding to the problem.

Optionally, the negative samples include a difficult negative sample and a simple negative sample; the difficult negative sample is the negative sample with the highest similarity to the problem in the plurality of negative samples; the simple negative samples comprise negative samples of the plurality of negative samples except the difficult negative samples; the similarity between the negative sample and the positive sample is calculated based on the coarse-grained semantic feature vector of the negative sample and the coarse-grained semantic feature vector of the positive sample; calculating a loss value of the initial model based on the semantic feature vectors of the sample problem, the positive sample and the negative sample, comprising: calculating a first loss value based on the semantic feature vectors of the sample problem, the positive sample and the difficult negative sample; calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample; and calculating the loss value of the initial model based on the first loss value and the corresponding first preset weight, the second loss value and the corresponding second preset weight.

Optionally, the step of calculating a first loss value based on the semantic feature vectors of the sample problem, the positive sample and the difficult negative sample includes: and calculating semantic feature vectors of the sample problem, the positive sample and the difficult negative sample based on a loss function of the pairing algorithm to obtain a first loss value.

Optionally, the step of calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample, and the simple negative sample includes: and calculating semantic feature vectors of the sample problem, the positive sample and the simple negative sample based on a loss function of a comparison learning algorithm to obtain a second loss value.

Optionally, the step of determining the ranking result of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results includes: aiming at each search result, calculating the semantic relevance of the target search problem and the search result based on the target search problem and the semantic feature vector of the search result; and sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results.

Referring to fig. 6, the electronic device includes a processor 100 and a memory 101, where the memory 101 stores machine executable instructions capable of being executed by the processor 100, and the processor 100 executes the machine executable instructions to implement the above-mentioned search result sorting method.

Further, the electronic device shown in fig. 6 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.

The Memory 101 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 6, but that does not indicate only one bus or one type of bus.

Processor 100 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 100. The Processor 100 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The present embodiments also provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the above-described method of ranking search results.

The method, apparatus and electronic device for ranking search results provided in the embodiments of the present invention include a computer-readable storage medium storing program codes, where the program codes include instructions that can be used to execute the methods described in the foregoing method embodiments, for example:

Optionally, the sample data is constructed in the following manner, including: searching document data containing the problems and answers of the problems from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the problems, and determining the problems as sample problems; the public data set includes a plurality of document data; determining the similarity between the positive sample and the document data of the positive sample in the public data set; determining a plurality of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; the preset conditions include: the answer to the question is not included in the document data, and the similarity between the positive sample and the document data satisfies the set similarity ranking condition.

Optionally, the step of determining the similarity between the positive sample and the document data of the positive sample except the positive sample in the public data set includes: coding a plurality of document data in the problem and public data set based on a preset coarse-grained semantic understanding model, and determining coarse-grained semantic feature vectors of the problem and the document data; and calculating the similarity between the positive sample and the document data of the positive sample in the public data set based on the coarse-grained semantic feature vector of the document data of the positive sample in the public data set and the coarse-grained semantic feature vector of the problem.

Optionally, the step of calculating the first loss value based on the semantic feature vectors of the sample problem, the positive sample and the difficult negative sample includes: and calculating semantic feature vectors of the sample problem, the positive sample and the difficult negative sample based on a loss function of the pairing algorithm to obtain a first loss value.

Optionally, the step of determining a ranking result of the multiple search results based on the target search problem and the semantic feature vectors of the multiple search results includes: aiming at each search result, calculating the semantic relevance between the target search problem and the search result based on the target search problem and the semantic feature vector of the search result; and sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases for those skilled in the art.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for ranking search results, the method comprising:

obtaining a plurality of search results corresponding to the target search problem;

inputting the target search question and a plurality of corresponding search results into a pre-trained sequencing model, and extracting semantic feature vectors of the target search question and the search results through the sequencing model; wherein the ranking model is built based on a self-coding language model; the sequencing model is obtained by training a plurality of training data comprising a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample comprises answers to the target search questions;

determining a ranking result of the plurality of search results based on the target search question and semantic feature vectors of the plurality of search results.

2. The method of claim 1, wherein the ranking model is trained by;

determining training data from pre-constructed sample data; the training data comprises a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample comprises an answer to the sample question;

inputting the training data into an initial model, and determining semantic features of the sample problem, the positive sample and the negative sample in the training data through the initial model; the initial model is established based on a self-coding language model;

calculating a loss value of the initial model based on semantic features of the sample problem, the positive sample and the negative sample;

updating model parameters of the initial model based on the loss values; and continuing to execute the step of determining training data from the sample data until the loss value is converged, and determining the initial model with the converged loss value as a sequencing model.

3. The method of claim 2, wherein the sample data is constructed by:

searching document data containing a question and an answer to the question from a pre-acquired public data set, determining the searched document data as a positive sample corresponding to the question, and determining the question as a sample question; the public data set includes a plurality of document data;

determining the similarity between the positive sample and the document data of the public data set except the positive sample;

determining a plurality of pieces of document data which meet preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the sample problem; the preset conditions include: the answer of the question is not included in the document data, and the similarity between the positive sample and the document data meets the set similarity sorting condition.

4. The method according to claim 3, wherein the step of determining the similarity of the positive sample and the document data of the positive sample except the positive sample in the public data set comprises:

coding the problem and a plurality of document data in the public data set based on a preset coarse-grained semantic understanding model, and determining coarse-grained semantic feature vectors of the problem and the document data;

and calculating the similarity between the question and the document data of the positive sample in the public data set based on the coarse-grained semantic feature vector of the document data of the positive sample in the public data set and the coarse-grained semantic feature vector of the question.

5. The method of claim 3, wherein the similarity ranking condition comprises: sorting positions of the document data of the non-positive sample in a sorting result generated based on the similarity of the document data in the public data set are smaller than or equal to a preset target position;

the step of determining a plurality of document data meeting preset conditions in the document data of the positive samples in the public data set as a plurality of negative samples corresponding to the problem includes:

judging whether the document data comprises an answer of the question or not aiming at each document data of the positive sample except the public data set;

if not, determining the document data as a negative sample to be selected;

sequencing the negative samples to be selected according to a sequencing mode of the similarity from high to low to obtain a sequencing result;

and determining the negative sample to be selected with the sorting position in the sorting result smaller than or equal to a preset target position as the negative sample corresponding to the problem.

6. The method of claim 2, wherein the plurality of negative examples comprise a difficult negative example and a simple negative example; the difficult negative sample is a negative sample with the highest similarity with the positive sample in the plurality of negative samples; the simple negative examples comprise negative examples of the plurality of negative examples except the difficult negative example; the similarity between the negative sample and the positive sample is calculated based on the coarse-grained semantic feature vector of the negative sample and the coarse-grained semantic feature vector of the positive sample;

calculating a loss value of the initial model based on the semantic feature vectors of the sample problem, the positive sample and the negative sample, comprising:

calculating a first loss value based on semantic feature vectors of the sample problem, the positive sample, and the difficult negative sample;

calculating a second loss value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample;

and calculating the loss value of the initial model based on the first loss value and the corresponding first preset weight, the second loss value and the corresponding second preset weight.

7. The method of claim 6, wherein the step of calculating a first penalty value based on the semantic feature vectors of the sample problem, the positive samples, and the difficult negative samples comprises:

and calculating semantic feature vectors of the sample problem, the positive sample and the difficult negative sample based on a loss function of a pairing algorithm to obtain a first loss value.

8. The method of claim 6, wherein the step of calculating a second penalty value based on the semantic feature vectors of the sample problem, the positive sample and the simple negative sample comprises:

and calculating the semantic feature vectors of the sample problem, the positive sample and the simple negative sample based on a loss function of a comparison learning algorithm to obtain a second loss value.

9. The method of claim 1, wherein determining a ranking result of the plurality of search results based on the target search question and semantic feature vectors of the plurality of search results comprises:

for each search result, calculating the semantic relevance of the target search question and the search result based on the target search question and the semantic feature vector of the search result;

and sequencing the plurality of search results according to a sequencing mode of the semantic relevance from high to low to obtain a sequencing result of the plurality of search results.

10. An apparatus for ranking search results, the apparatus comprising:

the search result acquisition module is used for acquiring a plurality of search results corresponding to the target search problem;

the feature extraction module is used for inputting the target search question and the corresponding search results into a pre-trained sequencing model, and extracting semantic feature vectors of the target search question and the search results through the sequencing model; wherein the ranking model is built based on a self-coding language model; the sequencing model is obtained by training a plurality of training data comprising a sample problem, a positive sample corresponding to the sample problem and a plurality of negative samples; the positive sample comprises answers to the target search questions;

and the sequencing result determining module is used for determining the sequencing result of the plurality of search results based on the target search problem and the semantic feature vectors of the plurality of search results.

11. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of ranking search results of any of claims 1-9.

12. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of ranking search results of any of claims 1-9.