CN116501950A - Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium - Google Patents

Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium Download PDF

Info

Publication number
CN116501950A
CN116501950A CN202210057510.7A CN202210057510A CN116501950A CN 116501950 A CN116501950 A CN 116501950A CN 202210057510 A CN202210057510 A CN 202210057510A CN 116501950 A CN116501950 A CN 116501950A
Authority
CN
China
Prior art keywords
current
sample
determining
recall model
recall
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210057510.7A
Other languages
Chinese (zh)
Inventor
纪兴光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN202210057510.7A priority Critical patent/CN116501950A/en
Publication of CN116501950A publication Critical patent/CN116501950A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, and discloses a recall model optimization method, device, equipment and storage medium. The method comprises the following steps: acquiring a current sequencing result of a first preset model; carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result; generating optimized sample data according to the current sequencing result and the corresponding label information; and completing optimization of the recall model to be optimized according to the optimized sample data. Through the mode, the existing recall model is optimized, and as the sorting results obtained through other models have certain similarity, the relevance is reevaluated on the basis of the sorting results, the sorting results are reevaluated, the existing recall model is further optimized, and the distinction degree and recall rate of the recall model are improved.

Description

Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a recall model optimization method, apparatus, device, and storage medium.
Background
The whole searching process can be roughly divided into three parts of recall, coarse row, fine row. The semantic recall model calculates semantic similarity scores of Query and Title, and selects a result with the highest score. For more efficient retrieval.
However, the training data of the existing semantic recall model are all original data directly mined from a webpage, and the difference of similarity and semantics between positive samples and negative samples is large due to the extremely wide collection span of sample data, so that the recall model trained by the original data has low distinguishing degree of texts.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a recall model optimization method, a recall model optimization device, recall model optimization equipment and a recall model storage medium, and aims to solve the technical problem that a recall model in the prior art is low in text distinguishing degree.
In order to achieve the above object, the present invention provides a recall model optimizing method, which comprises the steps of:
acquiring a current sequencing result of a first preset model;
carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result;
Generating optimized sample data according to the current sequencing result and the corresponding label information;
and completing optimization of the recall model to be optimized according to the optimized sample data.
Optionally, determining a search engine address according to a preset address table, where the preset address table is a search engine address searched by using a first preset model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
Optionally, performing relevance scoring on the recommended questions in the current sorting result to obtain relevance scores, wherein the current sorting result comprises a plurality of recommended questions;
labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
Optionally, acquiring a query problem corresponding to the recommended problem;
determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
Calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
Optionally, obtaining user log information corresponding to the current sequencing result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
Optionally, determining the total display times of the current sequencing result according to the display times;
determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
Optionally, determining a query sample according to the current sorting result, where the current sorting result includes a plurality of recommendation questions;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
Optionally, a second positive sample and a second negative sample are obtained, wherein the second positive sample and the second negative sample are positive samples and negative samples which do not correspond to the sequencing results of the same batch as the current sequencing result;
Obtaining a third negative sample according to the second positive sample and the second negative sample;
and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.
Optionally, determining a training sample and a test sample according to the optimized sample data;
training the recall model to be optimized according to the training sample to obtain a recall model to be tested;
testing the recall model to be tested according to the sample to be tested to obtain a test result;
and when the test result is that the test is successful, completing the optimization of the recall model to be optimized.
Optionally, inputting the training sample into a recall model to be optimized to determine model characterization parameters;
calculating a loss value according to the model characterization parameters;
and adjusting the recall model to be optimized according to the loss value until the model converges to obtain the recall model to be tested.
In addition, in order to achieve the above object, the present invention also provides a recall model optimizing apparatus, including:
the acquisition module is used for acquiring the current sequencing result of the first preset model;
the processing module is used for scoring the relevance of the current sequencing result to obtain label information corresponding to the current sequencing result;
The processing module is further used for generating optimized sample data according to the current sequencing result and the corresponding label information;
and the processing module is also used for completing the optimization of the recall model to be optimized according to the optimized sample data.
Optionally, the acquiring module is further configured to determine a search engine address according to a preset address table, where the preset address table is a search engine address searched by using a first preset model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
Optionally, the processing module is further configured to score a relevance of the recommended questions in the current ranking result, so as to obtain a relevance score, where the current ranking result includes a plurality of recommended questions;
labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
Optionally, the processing module is further configured to obtain a query problem corresponding to the recommended problem;
Determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
Optionally, the processing module is further configured to obtain user log information corresponding to the current sorting result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
Optionally, the processing module is further configured to determine a total number of times of displaying the current sorting result according to the number of times of displaying;
determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
Optionally, the processing module is further configured to determine a query sample according to the current sorting result, where the current sorting result includes a plurality of recommendation questions;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
Optionally, the processing module is further configured to obtain a second positive sample and a second negative sample, where the second positive sample and the second negative sample are positive samples and negative samples that do not correspond to the sorting results in the same batch as the current sorting result;
obtaining a third negative sample according to the second positive sample and the second negative sample;
and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.
In addition, in order to achieve the above object, the present invention also proposes a recall model optimizing apparatus including: a memory, a processor, and a recall model optimization program stored on the memory and executable on the processor, the recall model optimization program configured to implement the steps of the recall model optimization method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a recall model optimizing program which, when executed by a processor, implements the steps of the recall model optimizing method as described above.
The method comprises the steps of obtaining a current sequencing result of a first preset model; carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result; generating optimized sample data according to the current sequencing result and the corresponding label information; and completing optimization of the recall model to be optimized according to the optimized sample data. Through the mode, the existing recall model is optimized, and as the sorting results obtained through other models have certain similarity, the sorting results are re-marked and the existing recall model is further optimized by re-evaluating the relevance on the basis of the sorting results, so that the distinction degree and recall rate of the recall model are improved.
Drawings
FIG. 1 is a schematic diagram of a recall model optimizing apparatus for a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of the recall pattern optimization method of the present invention;
FIG. 3 is a flow chart of a second embodiment of the recall pattern optimization method of the present invention;
FIG. 4 is a block diagram of a first embodiment of a recall pattern optimization apparatus according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of recall model optimizing equipment of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the recall model optimizing apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the structure shown in FIG. 1 does not constitute a limitation of the recall model optimization apparatus, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a recall model optimizing program may be included in the memory 1005 as one type of storage medium.
In the recall model optimization apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the recall model optimizing apparatus of the present invention may be provided in the recall model optimizing apparatus, and the recall model optimizing apparatus calls the recall model optimizing program stored in the memory 1005 through the processor 1001 and executes the recall model optimizing method provided by the embodiment of the present invention.
The embodiment of the invention provides a recall model optimization method, and referring to fig. 2, fig. 2 is a flow diagram of a first embodiment of the recall model optimization method.
In this embodiment, the recall model optimization method includes the following steps:
Step S10: and obtaining the current sequencing result of the first preset model.
It should be noted that, the execution body of the embodiment is a model optimization system, and the model optimization system may be disposed in a server, or may be disposed in other devices having the same or similar functions as the server, which is not limited in this embodiment.
It should be noted that, in the process of model optimization, data with a certain similarity need to be mined first to optimize the model, so that the training data for optimization is selected to be extracted according to the current sorting result of the first preset model, and the sorting result is from the fine-ranking scene, for example: top10 results of the existing search engine are selected for labeling, the data distribution is greatly different from the coarse row and recall scene, and the semantic distinction degree of the model can be increased when the data is used for training.
It can be understood that the whole search flow can be roughly divided into three parts of recall- > coarse ranking- > fine ranking ", the first preset model is a model for ranking in the fine ranking process, the current ranking result is actually a fine ranking result, and the obtained fine ranking result only indicates that the current ranking result is derived from the first preset model, and the obtained fine ranking result is not limited to be directly obtained from the first preset model, but also can be a ranking result which is obtained by directly searching in a search website and finally subjected to the fine ranking process.
In this embodiment, a search engine address is determined according to a preset address table, where the preset address table is a search engine address searched by using a first preset model; capturing fine-ranking data in the search engine address according to the web crawler; and screening the fine ranking data to obtain a current ranking result.
In a specific implementation, this embodiment proposes a preferred manner for obtaining the sorting result, which is to first create a preset address table for a web crawler or a person to purposefully perform data capturing, because when the query questions are the same, different search engines give out recommendation questions that may be different, and the similarity between the same query question and the recommendation question is also good and bad, so that if the similarity is too large, the optimization process for improving the sorting result cannot be effectively performed, and if the similarity is too small, all the recommendation questions may be truly related, so that the number of negative samples meeting the requirement is too small, and the optimization process cannot be met, so that it is necessary to determine a preset address table in advance, so that the data obtain the sorting result given by the corresponding search engine from the address meeting the requirement.
It should be noted that, the relationship between the query question (query) and the recommendation question (title) is the relationship between the question and the similar question, for example: the search engine may feed back multiple options, i.e., recommended questions, based on the query questions, such as "display is flashing but screen is not bright? "computer screen is not bright", display screen pilot lamp is flashing "," millet panel computer display screen is not opened but the flashing lamp is always on "," how the electric water heater display should be flashing is processed the _design is the question and answer "," little Xiong Suannai machine display is flashing lamp always the beep and is not working ", etc., each title has certain relevant degree with the query input by the user, the user can select the question that oneself expects to find the corresponding answer.
The generating process of the preset address table may be that a target address set is obtained; sampling the sequencing results of all the target addresses in the target address set to obtain corresponding sampling sequencing results; and calculating the relevance index of each sampling sequencing result, and determining a preset address table according to each relevance index.
It can be understood that the target address set, i.e. the address of any search engine collected in advance, is screened, and an appropriate address is selected and put into a preset address table, wherein the screening process is to judge the correlation between the sorting result and the query problem in the addresses, the average similarity of each recommended problem in the sorting result or the similarity of the problem with the minimum similarity can be calculated, and the embodiment is not limited to this, wherein the sampling process of sampling the sorting result can be random sampling, i.e. the sorting result is obtained as a sample by randomly inputting the query problem in the selected search engine, and the sample is accumulated to a certain extent by repeating the process.
Step S20: and carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result.
It should be noted that, the relevance score is a process of labeling the sorting result, and the label is used for telling the to-be-trained recall model to be optimized whether each recommended problem in the sorting result is a positive sample or a negative sample, for example: firstly, training data of a model can be designed, wherein the training data is in the form of "Query (Query question)", a positive sample title+ (recommendation question for a positive sample), and a negative sample Title- (recommendation question for a negative sample) ". If the labels are divided into 3, 2, 1 and 0, wherein the positive sample is Title with Label 1 (Label 1) being 3 and 2, the negative sample is Title with Label 1 and 0, and the data distribution is introduced into the training data.
Each recommendation problem in the sequencing result is regarded as a title, and each title is marked according to the relevance score, and the label is used for providing a reference for generating a positive sample or a negative sample in a subsequent process.
In the specific implementation, the sorting result can be marked by using a manual marking method, and the manual marking data has high quality although being very time-consuming and labor-consuming, so that the manual marking mode is quite accurate in order to enable the model to be more similar to the distinguishing capability of people. In order to solve the problem of high cost of manual labeling time, a mode of combining manual auditing by automatic labeling can be adopted, a title with small distinction degree can be labeled manually, and then data with large distinction degree can be labeled in an automatic labeling mode to be used as a supplement of a sample.
Step S30: and generating optimized sample data according to the current sequencing result and the corresponding label information.
It should be noted that, sample data required for optimization can be obtained according to the current sorting result and the labeled label information.
In this embodiment, a query sample is determined according to the current sorting result, where the current sorting result includes a plurality of recommendation questions; determining a first positive sample and a first negative sample according to each recommended problem and label information; and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
It can be understood that determining the query sample according to the current sorting result is determining what the query question (query) corresponding to the sorting result is according to the sorting result, taking the query question as the query sample, and distinguishing the positive sample from the negative sample according to the label information to determine the first positive sample and the first negative sample. Finally, optimized sample data is generated from the query sample, the first positive sample, and the first negative sample. The first positive sample and the first negative sample are sample data obtained directly according to the label information from the current sorting result, for example: the recommendation questions queried by the current query questions are used as the same batch of data, each recommendation question in the batch of data has corresponding label information, positive samples and negative samples corresponding to the current query questions generated according to the batch of label information are a first positive sample and a first negative sample, but the data of the sample corresponding to the query questions are not necessarily all derived from the recommendation questions queried by the current query questions, and the data of the sample corresponding to the query questions are possibly samples supplemented by the recommendation questions of other batches, so that the samples obtained by the recommendation questions of the same batch as the query questions are the first positive samples or the first negative samples.
In this embodiment, a second positive sample and a second negative sample are obtained, where the second positive sample and the second negative sample are positive samples and negative samples that do not correspond to the sorting result of the same batch as the current sorting result; obtaining a third negative sample according to the second positive sample and the second negative sample; and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.
It can be understood that in the sample data corresponding to a query problem, not all the sample data are necessarily in the same batch as the query problem, the sample data are generated according to the recommended problem of other batches into a second positive sample and a second negative sample, wherein the third negative sample is a mixed sample of the first negative sample and the second negative sample, so that a part of Title needs to be extracted as a supplement of the negative samples, and the data distribution in the coarse row and recall scene is introduced into the training data by introducing the negative sample of massive weighted sampling, and meanwhile, the loss function of the model is designed. The common Triplet Loss and MSE Loss can only optimize the model by comparing the good-bad relations among several samples. However, more negative examples can be introduced into the cross entropy loss function, positive samples or negative samples of all other Query questions (queries) in the same batch of training data are used as negative samples of the current Query questions during training, so that full interaction between the current Query and a large number of negative samples is realized, distribution of the training data and consistency under rough ranking and recall scenes can be ensured as much as possible, and the model is not excessively focused on semantic relations of the ten previous results if the ten previous recommendation questions are selected in ranking.
Step S40: and completing optimization of the recall model to be optimized according to the optimized sample data.
It should be noted that, the optimization of the recall model to be optimized is essentially a retraining process of the recall model to be optimized according to the optimized sample data, where the recall model to be optimized may be a model trained by common data, but because the relationship of training data thereof, the recall model has no strong distinguishing degree of the recommendation problem (title), so the recall model to be optimized may be retrained according to the optimized sample data, so as to achieve the purpose of optimization.
In this embodiment, determining a training sample and a test sample from the optimized sample data; training the recall model to be optimized according to the training sample to obtain a recall model to be tested; testing the recall model to be tested according to the sample to be tested to obtain a test result; and when the test result is that the test is successful, completing the optimization of the recall model to be optimized. For example: about 10 ten thousand pieces of optimized sample data are extracted from the optimized sample data for testing, and the rest are used for model training.
In the embodiment, the training sample is input into a recall model to be optimized to determine model characterization parameters; calculating a loss value according to the model characterization parameters; and adjusting the recall model to be optimized according to the loss value until the model converges to obtain the recall model to be tested.
In a specific implementation, the model characterization parameters are training parameters required for calculating the loss value, and the model characterization parameters are different by using different loss functions, so the type of the loss function is not limited in this embodiment, and an implementation is provided, for example: the scheme designs the loss function of the recall model to be optimized. The common Triplet Loss and MSE Loss can only optimize the model by comparing the good-bad relations among several samples. However, more negative examples can be introduced into the cross entropy loss function, and Title of all other samples in the same batch is used as the negative example during training, so that the full interaction between the current Query and a large number of negative samples is realized, therefore, the loss value can be calculated by the cross entropy loss function, and the recall model to be optimized is adjusted according to the loss value until the model converges, so that the recall model to be tested is obtained.
The embodiment obtains a current sequencing result of a first preset model; carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result; generating optimized sample data according to the current sequencing result and the corresponding label information; and completing optimization of the recall model to be optimized according to the optimized sample data. Through the mode, the existing recall model is optimized, and as the sorting results obtained through other models have certain similarity, the relevance is reevaluated on the basis of the sorting results, the sorting results are reevaluated, the existing recall model is further optimized, and the distinction degree and recall rate of the recall model are improved.
Referring to fig. 3, fig. 3 is a flowchart of a second embodiment of a recall model optimization method according to the present invention.
Based on the above first embodiment, the recall model optimizing method of the present embodiment further includes, in the step S20:
step S21: and carrying out relevance scoring on the recommended questions in the current sorting result to obtain relevance scores, wherein the current sorting result comprises a plurality of recommended questions.
It should be noted that the current sorting result is the sorting result currently being processed. The process of scoring each recommendation problem in the current sequencing result can use a manual scoring method to score the sequencing result, and the manually marked data is time-consuming and labor-consuming, but has high quality, and the score obtained by manual scoring is always the best sample data; the recommendation problem can also be obtained according to the showing times, click rate and other similarity calculation means, and the embodiment is not limited thereto.
In this embodiment, a query question corresponding to the recommended question is obtained; determining a corresponding first target semantic vector according to the query problem; determining a corresponding second target semantic vector according to the recommendation problem; calculating the similarity of the first target semantic vector and the second target semantic vector; and determining a relevance score according to the similarity.
It may be appreciated that the query problem and the recommendation problem semantic vector may be extracted according to a semantic vector model, where the first target semantic vector is the query problem semantic vector, the second target semantic vector is the recommendation problem semantic vector, and the cosine value of the included angle between the two vectors is calculated to obtain the similarity of the two vectors, so as to score the period relevance according to the similarity, for example: when the similarity is larger than a preset similarity threshold, the correlation is 1, otherwise, the correlation is 0.
In this embodiment, user log information corresponding to the current sorting result is obtained; determining the showing times of each recommended problem according to the user log information; and determining a relevance score according to the display times corresponding to each recommendation problem.
It should be noted that, the relevance degree may also be determined according to the number of times of displaying the recommended questions, that is, the number of times of selecting and displaying the recommended questions by the user when the user searches, may be obtained from the user log information, and the more the number of times of displaying a recommended question, the higher the relevance degree of the recommended question and the query question is considered by the user, so that the relevance degree of the recommended question may be determined according to the number of times of displaying, and the recommended question with a low relevance degree may be directly added as a negative sample to the optimized sample data.
In this embodiment, determining a total display number of the current sorting result according to the display number; determining a target weighting coefficient according to the total display times; and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
It can be understood that, for a hot problem, the total display amount of the problem is several orders of magnitude greater than that of a cold problem, so that scoring the recommended problem display times corresponding to different query problems with a constant display time is affected by the problem hot degree and becomes quite inaccurate, so that a weighting coefficient should be determined according to the total display times corresponding to the query problems, the display times of each recommended problem should be increased or reduced proportionally, and then the recommended problem with the display times lower than the threshold value is taken as a negative sample. The total number of times of displaying the current sorting result is the sum of the number of times of displaying all the recommended questions in the current sorting result, for example: and if the current sorting result is a top10 recommended problem, summing the display times of the 10 recommended problems to obtain the total display times.
Step S22: and labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score.
It should be noted that, when the relevance score is greater than or equal to the preset score, the relevance degree of the query problem and the recommendation problem is higher, so that the corresponding recommendation problem should be marked with a positive label, and thus, the recommendation problem is determined to be optimized as a positive sample for the model.
Step S23: and labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score.
It should be noted that, when the relevance score is smaller than the preset score, the relevance degree of the query problem and the recommendation problem is low, so that the corresponding recommendation problem should be marked with a negative label, and thus, the recommendation problem is determined to be optimized as a negative sample.
Step S24: and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
It can be understood that the label information corresponding to the current sorting result is determined according to the positive label and the negative label, wherein the sorting result recommended by one query question contains a plurality of recommended questions, each recommended question corresponds to one label information, and according to the label information, when the optimized sample data is generated, the recommended question can be determined to be divided into a positive sample or a negative sample of the one query question.
In the embodiment, relevance scoring is performed on the recommended questions in the current sorting result to obtain relevance scores, wherein the current sorting result comprises a plurality of recommended questions; labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score; labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score; and determining label information corresponding to the current sequencing result according to each positive label and each negative label. By the method, the label information is obtained, the positive labels or the negative labels are marked for the recommendation problems in a manual marking or automatic marking mode, and the positive labels and the negative labels obtained on the basis of high overall similarity of the recommendation problems are favorable for generating high-quality sample data because the recommendation problems are derived from the sequencing results of the search engine, so that the distinction degree of the recall model to be optimized is further improved.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a recall model optimizing program, and the recall model optimizing program realizes the steps of the recall model optimizing method when being executed by a processor.
Referring to fig. 4, fig. 4 is a block diagram showing the structure of a first embodiment of the recall pattern optimizing apparatus according to the present invention.
As shown in fig. 4, the recall model optimizing device provided by the embodiment of the invention includes:
an obtaining module 10, configured to obtain a current sorting result of the first preset model;
the processing module 20 is configured to score the relevance of the current sorting result, so as to obtain tag information corresponding to the current sorting result;
the processing module 20 is further configured to generate optimized sample data according to the current sorting result and the corresponding tag information;
the processing module 20 is further configured to complete optimization of the recall model to be optimized according to the optimized sample data.
It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.
The obtaining module 10 of this embodiment obtains a current sorting result of the first preset model; the processing module 20 scores the relevance of the current sorting result to obtain label information corresponding to the current sorting result; processing module 20 generates optimized sample data according to the current ordering result and the corresponding tag information; the processing module 20 completes the optimization of the recall model to be optimized according to the optimized sample data. Through the mode, the existing recall model is optimized, and as the sorting results obtained through other models have certain similarity, the relevance is reevaluated on the basis of the sorting results, the sorting results are reevaluated, the existing recall model is further optimized, and the distinction degree and recall rate of the recall model are improved.
In an embodiment, the obtaining module 10 is further configured to determine a search engine address according to a preset address table, where the preset address table is a search engine address searched by using a first preset model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
In an embodiment, the processing module 20 is further configured to score a relevance of the recommended questions in the current ranking result, to obtain a relevance score, where the current ranking result includes a plurality of recommended questions;
labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
In an embodiment, the processing module 20 is further configured to obtain a query question corresponding to the recommended question;
determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
Calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
In an embodiment, the processing module 20 is further configured to obtain user log information corresponding to the current ranking result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
In an embodiment, the processing module 20 is further configured to determine a total number of times of displaying the current sorting result according to the number of times of displaying;
determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
In an embodiment, the processing module 20 is further configured to determine a query sample according to the current ranking result, where the current ranking result includes a plurality of recommendation questions;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
In an embodiment, the processing module 20 is further configured to obtain a second positive sample and a second negative sample, where the second positive sample and the second negative sample are positive samples and negative samples that do not correspond to the same batch of sorting results as the current sorting result;
Obtaining a third negative sample according to the second positive sample and the second negative sample;
and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.
In an embodiment, the processing module 20 is further configured to determine a training sample and a test sample according to the optimized sample data;
training the recall model to be optimized according to the training sample to obtain a recall model to be tested;
testing the recall model to be tested according to the sample to be tested to obtain a test result;
and when the test result is that the test is successful, completing the optimization of the recall model to be optimized.
In an embodiment, the processing module 20 is further configured to input the training samples into a recall model to be optimized to determine model characterization parameters;
calculating a loss value according to the model characterization parameters;
and adjusting the recall model to be optimized according to the loss value until the model converges to obtain the recall model to be tested.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
In addition, technical details not described in detail in this embodiment may refer to the recall model optimization method provided in any embodiment of the present invention, which is not described herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
The invention also discloses A1, a recall model optimization method, which comprises the following steps:
acquiring a current sequencing result of a first preset model;
carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result;
generating optimized sample data according to the current sequencing result and the corresponding label information;
and completing optimization of the recall model to be optimized according to the optimized sample data.
A2, the method of A1, wherein the step of obtaining the current sequencing result of the first preset model includes:
determining a search engine address according to a preset address table, wherein the preset address table is the search engine address searched by using a first preset model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
A3, the method of A1, the scoring the relevance of the current sorting result to obtain the label information corresponding to the current sorting result, includes:
carrying out relevance scoring on the recommended questions in the current sorting result to obtain relevance scores, wherein the current sorting result comprises a plurality of recommended questions;
labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
A4, the method of A3, wherein the step of scoring the relevance of the current sorting result to obtain a relevance score comprises the following steps:
acquiring a query problem corresponding to the recommended problem;
determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
A5, the method of A3, the scoring the relevance of the current sorting result to obtain a relevance score, includes:
Acquiring user log information corresponding to the current sequencing result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
A6, the method of A5, wherein determining the relevance score according to the number of presentations includes:
determining the total display times of the current sequencing result according to the display times;
determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
A7, generating optimized sample data according to the current sorting result and the corresponding label information according to the method of A1, wherein the method comprises the following steps:
determining a query sample according to the current sequencing result, wherein the current sequencing result comprises a plurality of recommendation problems;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
A8, the method of A7, the generating optimized sample data from the query sample, the first positive sample, and the first negative sample, further comprising:
Acquiring a second positive sample and a second negative sample, wherein the second positive sample and the second negative sample are positive samples and negative samples which do not correspond to the sequencing results of the same batch as the current sequencing results;
obtaining a third negative sample according to the second positive sample and the second negative sample;
and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.
A9, the method of A1, the optimizing the recall model to be optimized according to the optimized sample data, comprises the following steps:
determining a training sample and a test sample according to the optimized sample data;
training the recall model to be optimized according to the training sample to obtain a recall model to be tested;
testing the recall model to be tested according to the sample to be tested to obtain a test result;
and when the test result is that the test is successful, completing the optimization of the recall model to be optimized.
A10, training the recall model to be optimized according to the training sample to obtain a recall model to be tested, wherein the method comprises the following steps:
inputting the training sample into a recall model to be optimized to determine model characterization parameters;
calculating a loss value according to the model characterization parameters;
And adjusting the recall model to be optimized according to the loss value until the model converges to obtain the recall model to be tested.
The invention also discloses a B11, a recall model optimizing device, which comprises:
the acquisition module is used for acquiring the current sequencing result of the first preset model;
the processing module is used for scoring the relevance of the current sequencing result to obtain label information corresponding to the current sequencing result;
the processing module is further used for generating optimized sample data according to the current sequencing result and the corresponding label information;
and the processing module is also used for completing the optimization of the recall model to be optimized according to the optimized sample data.
B12, the apparatus of B11, the said acquisition module, is further used for confirming the search engine address according to the address table of preseting, the said address table of preseting is the search engine address searched using the first preseting model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
B13, the device of B11, the processing module is further configured to score the relevance of the recommended questions in the current ranking result, so as to obtain a relevance score, where the current ranking result includes a plurality of recommended questions;
Labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
The device as described in B14, the processing module is further configured to obtain a query question corresponding to the recommended question;
determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
B15, the device of B13, the said processing module, is used for obtaining the user log information corresponding to said current sequencing result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
B16, the device of B15, the processing module is further configured to determine a total number of times of display of the current sorting result according to the number of times of display;
Determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
B17, the apparatus of B11, the processing module is further configured to determine a query sample according to the current ranking result, where the current ranking result includes a plurality of recommendation questions;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
B18, the apparatus of B17, the processing module further configured to obtain a second positive sample and a second negative sample, where the second positive sample and the second negative sample are positive samples and negative samples that do not correspond to a sorting result of the same batch as the current sorting result;
obtaining a third negative sample according to the second positive sample and the second negative sample;
and generating optimized sample data according to the query sample, the first positive sample, the first negative sample and the third negative sample.

Claims (10)

1. A recall model optimization method, characterized in that the recall model optimization method comprises:
acquiring a current sequencing result of a first preset model;
Carrying out relevance scoring on the current sequencing result to obtain label information corresponding to the current sequencing result;
generating optimized sample data according to the current sequencing result and the corresponding label information;
and completing optimization of the recall model to be optimized according to the optimized sample data.
2. The method of claim 1, wherein the obtaining the current ranking result of the first preset model comprises:
determining a search engine address according to a preset address table, wherein the preset address table is the search engine address searched by using a first preset model;
capturing fine-ranking data in the search engine address according to the web crawler;
and screening the fine ranking data to obtain a current ranking result.
3. The method of claim 1, wherein scoring the relevance of the current ranking result to obtain tag information corresponding to the current ranking result comprises:
carrying out relevance scoring on the recommended questions in the current sorting result to obtain relevance scores, wherein the current sorting result comprises a plurality of recommended questions;
labeling the corresponding recommendation problem with a positive label when the relevance score is larger than or equal to a preset score;
Labeling the corresponding recommended problems with negative labels when the relevance score is smaller than a preset score;
and determining label information corresponding to the current sequencing result according to each positive label and each negative label.
4. The method of claim 3, wherein said scoring the relevance of the current ranking result to obtain a relevance score comprises:
acquiring a query problem corresponding to the recommended problem;
determining a corresponding first target semantic vector according to the query problem;
determining a corresponding second target semantic vector according to the recommendation problem;
calculating the similarity of the first target semantic vector and the second target semantic vector;
and determining a relevance score according to the similarity.
5. The method of claim 3, wherein said scoring the relevance of the current ranking result to obtain a relevance score comprises:
acquiring user log information corresponding to the current sequencing result;
determining the showing times of each recommended problem according to the user log information;
and determining a relevance score according to the display times corresponding to each recommendation problem.
6. The method of claim 5, wherein said determining a relevance score based on said number of impressions comprises:
Determining the total display times of the current sequencing result according to the display times;
determining a target weighting coefficient according to the total display times;
and determining the relevance scores corresponding to the recommendation questions according to the target weighting coefficients and the display times.
7. The method of claim 1, wherein generating optimized sample data from the current ranking result and corresponding tag information comprises:
determining a query sample according to the current sequencing result, wherein the current sequencing result comprises a plurality of recommendation problems;
determining a first positive sample and a first negative sample according to each recommended problem and label information;
and generating optimized sample data according to the query sample, the first positive sample and the first negative sample.
8. A recall model optimizing apparatus, characterized in that the recall model optimizing apparatus comprises:
the acquisition module is used for acquiring the current sequencing result of the first preset model;
the processing module is used for scoring the relevance of the current sequencing result to obtain label information corresponding to the current sequencing result;
the processing module is further used for generating optimized sample data according to the current sequencing result and the corresponding label information;
And the processing module is also used for completing the optimization of the recall model to be optimized according to the optimized sample data.
9. A recall model optimizing apparatus, the apparatus comprising: a memory, a processor, and a recall model optimization program stored on the memory and executable on the processor, the recall model optimization program configured to implement the steps of the recall model optimization method of any one of claims 1 to 7.
10. A storage medium having stored thereon a recall model optimization program which, when executed by a processor, implements the steps of the recall model optimization method of any one of claims 1 to 7.
CN202210057510.7A 2022-01-18 2022-01-18 Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium Pending CN116501950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210057510.7A CN116501950A (en) 2022-01-18 2022-01-18 Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210057510.7A CN116501950A (en) 2022-01-18 2022-01-18 Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium

Publications (1)

Publication Number Publication Date
CN116501950A true CN116501950A (en) 2023-07-28

Family

ID=87317142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210057510.7A Pending CN116501950A (en) 2022-01-18 2022-01-18 Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium

Country Status (1)

Country Link
CN (1) CN116501950A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743950A (en) * 2024-02-20 2024-03-22 浙江口碑网络技术有限公司 Correlation judgment method and LLM-based correlation judgment model construction method
CN117743950B (en) * 2024-02-20 2024-05-28 浙江口碑网络技术有限公司 Correlation judgment method and LLM-based correlation judgment model construction method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117743950A (en) * 2024-02-20 2024-03-22 浙江口碑网络技术有限公司 Correlation judgment method and LLM-based correlation judgment model construction method
CN117743950B (en) * 2024-02-20 2024-05-28 浙江口碑网络技术有限公司 Correlation judgment method and LLM-based correlation judgment model construction method

Similar Documents

Publication Publication Date Title
US20230297581A1 (en) Method and system for ranking search content
CN104077306B (en) The result ordering method and system of a kind of search engine
CN111368042A (en) Intelligent question and answer method and device, computer equipment and computer storage medium
CN109299344A (en) The generation method of order models, the sort method of search result, device and equipment
CN109447958B (en) Image processing method, image processing device, storage medium and computer equipment
CN109857938B (en) Searching method and searching device based on enterprise information and computer storage medium
CN110610193A (en) Method and device for processing labeled data
CN110737756B (en) Method, apparatus, device and medium for determining answer to user input data
CN111639247A (en) Method, apparatus, device and computer-readable storage medium for evaluating quality of review
CN114707074A (en) Content recommendation method, device and system
CN112417848A (en) Corpus generation method and device and computer equipment
CN111563207B (en) Search result sorting method and device, storage medium and computer equipment
CN111104422B (en) Training method, device, equipment and storage medium of data recommendation model
CN110209944B (en) Stock analyst recommendation method and device, computer equipment and storage medium
CN113836388A (en) Information recommendation method and device, server and storage medium
CN116501950A (en) Recall model optimization method, recall model optimization device, recall model optimization equipment and recall model storage medium
CN110955774A (en) Word frequency distribution-based character classification method, device, equipment and medium
CN115860283A (en) Contribution degree prediction method and device based on portrait of knowledge worker
CN115168700A (en) Information flow recommendation method, system and medium based on pre-training algorithm
CN116303910A (en) Question and answer page recommendation method, device, equipment and storage medium
CN113590673A (en) Data heat degree statistical method based on block chain deep learning
CN110569436A (en) network media news recommendation method based on high-dimensional auxiliary information
CN112749313A (en) Label labeling method and device, computer equipment and storage medium
CN117217852B (en) Behavior recognition-based purchase willingness prediction method and device
CN110688567A (en) Mobile application program searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination