CN114281935A

CN114281935A - Training method, device, medium and equipment for search result classification model

Info

Publication number: CN114281935A
Application number: CN202111086790.6A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-04-05

Abstract

The application discloses a training method, a device, a medium and equipment of a search result classification model, which relate to the field of artificial intelligence, wherein the method comprises the following steps: acquiring a training data set, wherein the training data set comprises search query words and search results corresponding to the search query words; determining similarity of the search query term and the search result based on a target text matching algorithm; determining a classification label of a sample according to a preset classification interval and the similarity, wherein the sample comprises the search query word and the search result; and fine-tuning the pre-trained language model according to the sample and the corresponding classification label to obtain a search result classification model. Based on the search result classification model obtained by the method provided by the application, the obtained search result can be matched with the search query word in the similarity dimension, the search speed is improved, and the search requirement of a user is met.

Description

Training method, device, medium and equipment for search result classification model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method, a device, a medium and equipment for a search result classification model.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, such as natural language processing, machine learning, deep learning and the like. With the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important value.

With the development of word vector embedding representation technology, a text pre-training model becomes a basic module of a plurality of natural language processing tasks and upstream application scenarios thereof more and more. As in the search engine, the content of the search query word input by the user and the candidate search result is embedded and expressed through a pre-training language model, and the correlation between the search query word and the candidate search result is determined according to the similarity between the vectors. At present, the language model can better capture the relation between syntax and semantic vocabulary, but does not meet the requirement that the similarity between search query words and search results is matched under a search scene.

Disclosure of Invention

In order to meet the requirement that the similarity of a search result and a search query word is more matched, the application provides a training method, a device, a medium and equipment of a search result classification model. The technical scheme is as follows:

in a first aspect, the present application provides a method for training a search result classification model, where the method includes:

acquiring a training data set, wherein the training data set comprises search query words and search results corresponding to the search query words;

determining similarity of the search query term and the search result based on a target text matching algorithm;

determining a classification label of a sample according to a preset classification interval and the similarity, wherein the sample comprises the search query word and the search result;

and fine-tuning the pre-trained language model according to the sample and the corresponding classification label to obtain a search result classification model.

In a second aspect, the present application provides an apparatus for training a search result classification model, the apparatus comprising:

the data acquisition module is used for acquiring a training data set, wherein the training data set comprises search query words and search results corresponding to the search query words;

the similarity calculation module is used for determining the similarity between the search query words and the search results based on a target text matching algorithm;

a classification label determination module, configured to determine a classification label of a sample according to a preset classification interval and the similarity, where the sample includes the search query term and the search result;

and the model fine-tuning module is used for fine-tuning the pre-trained language model according to the sample and the corresponding classification label to obtain a search result classification model.

In a third aspect, the present application provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded and executed by a processor to implement a method for training a search result classification model according to the first aspect.

In a fourth aspect, the present application provides a computer device, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the method for training a search result classification model according to the first aspect.

In a fifth aspect, the present application provides a computer program product comprising computer instructions which, when executed by a processor, implement a method of training a search result classification model according to the first aspect.

The training method, the training device, the training medium and the training equipment for the search result classification model have the following technical effects:

according to the scheme provided by the application, the similarity between the search query word and the search result is calculated through a text matching algorithm, and the classification label of the sample is determined based on the similarity and the preset grading interval, wherein the sample comprises the search query word and the corresponding search result; and then, the existing pre-training language model is finely adjusted through the sample and the classification label of the sample to obtain a search result classification model, and the search result classification model can be applied to search application, so that the obtained search result is more matched with the search query word in the similarity dimension, the search requirement of a user is further met, and the user experience is improved. Specifically, the BM25 formula is rewritten in the text matching algorithm, the morpheme coverage rate and other factors are increased, and the calculation of the similarity of the morpheme weight and the similarity of the morpheme and the search result is improved, so that the text matching algorithm focuses more on the similarity of the search query word and the search result in content to meet the requirement of model application. In addition, compared with the method for training the model from the beginning, the training method for fine tuning the pre-trained language model can reduce the required training data amount, save the model training cost and improve the model training efficiency.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a semantic model training method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a semantic model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart illustrating a process for determining similarity between search query terms and search results according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of determining similarity between morphemes and search results according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a method for determining weights of morphemes according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of another method for determining similarity between search query terms and search results according to an embodiment of the present application;

FIG. 7 is a schematic flow chart illustrating fine tuning of a language model according to an embodiment of the present application;

FIG. 8 is a flow chart illustrating a process for determining text features according to an embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating another fine-tuning of a language model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a training apparatus for a search result classification model according to an embodiment of the present application;

fig. 11 is a hardware structural diagram of an apparatus for implementing a training method for a search result classification model according to an embodiment of the present application.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

The scheme provided by the embodiment of the application relates to the technologies of Deep Learning (DL) of artificial intelligence, Natural Language Processing (NLP) and the like.

Deep Learning (DL) is a major research direction in the field of Machine Learning (ML), and is introduced into Machine Learning to make it closer to the original goal, artificial intelligence. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. The deep learning enables the machine to imitate human activities such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes great progress on the artificial intelligence related technology.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The scheme provided by the embodiment of the application can be deployed at the cloud end, and further relates to cloud technology and the like.

Cloud technology (Cloud technology): the cloud computing business model based management system is a management technology for unifying series resources such as hardware, software and networks in a wide area network or a local area network to realize calculation, storage, processing and sharing of data, can also be understood as a general term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a cloud computing business model, can form a resource pool, and is used as required, flexible and convenient. Background services of a technical network system require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, with the high development and application of the internet industry, each article in the future may have its own identification mark and needs to be transmitted to a background system for logic processing, data at different levels are processed separately, and data in various industries need strong system support, so that cloud computing is required as support in the cloud technology. Cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.

In order to meet the requirement of similarity matching between a search result and a search query word, the embodiment of the application provides a training method, a device, a medium and equipment of a search result classification model. The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate understanding of the technical solutions and the technical effects thereof described in the embodiments of the present application, the embodiments of the present application explain related terms:

BERT: bidirectional Encoder Representation from transforms, converter-based bi-directional encoded representations; is a pre-trained language characterization model.

Word2 vec: word to vector is a correlation model used to generate a group of Word vectors, such as Bag-of-words model (Bag-of-words model), Skip-Word model (Skip-gram), etc. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. After training is complete, the word2vec model may be used to map each word to a vector, which may be used to represent word-to-word relationships.

TF-IDF: term Frequency-Inverse Document Frequency, a commonly used weighting technique for information retrieval and data mining. TF (term frequency) is the word frequency, and IDF (inverse Document frequency) is the inverse Document frequency index.

BM 25: (BM, Best Matching), derived from probabilistic correlation models, is based on TF-IDF improved text Matching algorithms, called next generation TF-IDF.

Fine-tuning: it can be called fine tuning, which means adding a small amount of task-specific parameters on the basis of the already trained language model, for example, adding a softmax network on the basis of the language model for the classification problem, and then re-training on the new corpus to complete fine tuning.

Referring to fig. 1, which is a schematic diagram of an implementation environment of a training method for a search result classification model according to an embodiment of the present disclosure, as shown in fig. 1, the implementation environment may at least include a client 01 and a server 02.

Specifically, the client 01 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a digital assistant, a smart wearable device, a monitoring device, a voice interaction device, and other types of devices, and may also include software running in the devices, such as a web page provided by some service providers to the user, and applications provided by the service providers to the user. Specifically, the client 01 may be configured to collect a required training data set for a search application, where the training data set may include search query terms and search results corresponding to the search query terms, and may be obtained from a historical search log.

Specifically, the server 02 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The server 02 may comprise a network communication unit, a processor and a memory, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. Specifically, the server 02 may be configured to respond to a request of the client 01, obtain a training data set, calculate similarity between a search query word and a search result based on a preset target text matching algorithm, classify the search result according to the similarity, determine a classification label of a sample, where the sample is a combination of the search query word and the search result, and finally perform fine tuning on a pre-trained language model through the sample and the classification label of the sample, to obtain a search result classification model meeting a similarity requirement in a search scene.

The embodiment of the present application can also be implemented by combining a Cloud technology, which refers to a hosting technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing, and can also be understood as a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology, and the like applied based on a Cloud computing business model. Cloud technology requires cloud computing as a support. Cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Specifically, the server 02 and the database are located in the cloud, and the server 02 may be an entity machine or a virtualization machine.

The following describes a training method of a search result classification model provided by the present application. Fig. 2 is a flowchart of a method for training a search result classification model according to an embodiment of the present application, which provides the method operation steps according to the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Referring to fig. 2, a training method for a search result classification model provided in an embodiment of the present application may include the following steps:

s210: a training data set is obtained, wherein the training data set comprises search query words and search results corresponding to the search query words.

It will be appreciated that there may be multiple search results corresponding to a search query term, with the multiple search results comprising a set of search results corresponding to the search query term. In the embodiment of the application, the search query word and a corresponding search result are used as a training sample, and the data used for training are historical data.

S230: determining similarity of the search query term and the search result based on a target text matching algorithm.

It can be understood that the essence of the current unsupervised pre-training language model is a word embedding technology based on neural network, which can better capture the relationship between syntax and semantic vocabulary. The pre-trained model learns the vector representation of the word through context, which is actually what the word appears in the context often associated with the current word. For example, two words, namely "Beijing" and "Shanghai", can appear together in the context of mass linguistic data, so that the vector similarity expressed by the pre-training model is marked high, and no problem is found from the semantic perspective, but in the search scene, if a user searches for "Beijing", the search results related to "Shanghai" are only related but not similar, that is, the correlation is not similar. From the above analysis, the context correlation of the more emphasized words of the existing pre-trained language model is not suitable for searching the scenes with similarity matching of the more emphasized query (search query words) and doc (documents or search results).

In the embodiment of the application, the classification labels of the training samples are determined by calculating the similarity between the search query words and the search results, so that the pre-trained language model is adjusted to meet the requirements under the search scene. The similarity is not equal to the relevance, the similarity in the application emphasizes the similarity of the contents, and if the search query word is 'background', the search results required to be displayed are all information pointing to Beijing.

In the embodiment of the present application, the target text matching algorithm is used to calculate the similarity between the search terms and the search results on the text content, and specifically, the target text matching algorithm may be any one of a conventional character matching rule, a BM25 algorithm, a Jaccard similarity algorithm, a cosine similarity algorithm, and the like.

For example, the similarity between the search query term and the search result may be calculated by using a conventional character matching rule, and specifically, if the search result is a web page, an article, or the like, the similarity between the search query term and the text content of the search result may be calculated, where the text content may be a title or a body; if the search results are media resources such as pictures, audio, video and the like, the similarity between the search query words and the titles of the search results can be calculated.

Illustratively, considering that in a similarity calculation scenario of a search query word and a title, the length of the title is often short, and weights of inverse document frequency values of different hit words are different, the similarity may be calculated by using the BM25 algorithm. The BM25 algorithm is commonly used as a search relevance score, and its main idea is: performing morpheme analysis on Query to generate morpheme q_iThen for search result D, calculate each morpheme q_iThe correlation with D is scored and finally, q is assigned_iAnd carrying out weighted summation with the relevance scores of D, thereby obtaining the relevance scores of Query and D. Specifically, the commonly used BM25 formula is shown in formula (1):

q in formula (1) represents query, namely a search query word input by a user; s represents a document, which is the title of the current document in the application, and in other application scenes such as video search, picture search and the like, S can also represent the title of a search result; IDF (q)_i) Representing the ith word Q in Q_iThe inverse document frequency value of (2) is represented as q in formula (1)_iA weight corresponding to the relevance score of D; TF (qi) represents the ith word Q in query Q_iThe word frequency of; length(s) represents the length of the title of the current document, and avglength(s) represents the average length of the titles of the corresponding multiple documents; the function of the parameter b is mainly to adjust the influence of the text title length on the similarity, k is an adjusting factor for the word frequency, for example, k can be set to 2, and b can be set to 0.75.

Based on the formula (1), BM25 algorithm is improved, morpheme coverage and other factors are increased, calculation of the similarity of the morpheme and the search result is improved, and the morpheme coverage can represent the coincidence degree of contents, so that the calculation result can reflect the similarity rather than the correlation.

In an embodiment of the present application, specifically, as shown in fig. 3, the step S230 may include the following steps:

s310: and performing morpheme analysis on the search query words, and determining one or more target morphemes in the search query words.

Specifically, performing morpheme analysis on the search query word generally includes performing word segmentation to obtain a sequence q consisting of target morphemes₁、q₂......q_n. In the embodiments of the present application, the target morphemes are also referred to as hits.

In one exemplary embodiment, for English text, the text may be separated by spaces. For Chinese text, there are two main types of approaches: the method comprises the steps of performing dictionary-based word segmentation algorithm and statistic-based word segmentation algorithm, wherein the dictionary-based word segmentation algorithm is essentially character string matching, performing character string matching on a Chinese text to be matched and a dictionary based on a certain matching strategy, and segmenting words according to a matching result if matching is successful; the word segmentation algorithm based on statistics is a sequence labeling problem essentially, and is characterized in that words are labeled according to positions of characters in a text based on statistical probability to obtain a sequence labeling result of the Chinese text, and then the Chinese text is segmented according to the sequence labeling result.

S330: and determining the target similarity of each target morpheme and the search result based on a target text matching algorithm.

In one possible implementation, as shown in fig. 4, in the case that the search query word corresponds to a plurality of search results, the step S330 may include the following steps:

s331: and determining the word frequency of the target morpheme in a plurality of search results.

Specifically, a plurality of search results can be used as a search result set, and the word frequency represents the number of times that the target morpheme appears in the title of each search result in the search result set. In addition, if the similarity is calculated for the target morpheme and the text content of the search result (e.g., document), the word frequency may be the frequency of occurrence of the target morpheme in a single search result. A calculation formula can be defined according to the specific application category or service requirement adaptability for the word frequency, which is not described herein.

S333: determining a first quantity of words of the search query term and a second quantity of words of a title of the search result; the search result is any one of the plurality of search results.

Specifically, the number of target morphemes in the search query word is used as a first word quantity, the title of the search result is also subjected to word segmentation, and the obtained number of words is used as a second word quantity.

S335: and obtaining the target similarity of the target morpheme and the search result according to the word frequency, the first word quantity and the second word quantity.

For example, the target similarity calculation between the target morpheme qi and the search result S may be as shown in formula (2):

wherein, TF (q)_i) Representing the word frequency of a target morpheme qi in a plurality of search results, length (Q) representing the number of words after word segmentation in query, namely a first word quantity, length (S) representing the number of the title words of the search results S, namely a second word quantity, introducing a parameter b to adjust the influence of the title length of S on the similarity of the target, and introducing a parameter k to adjust the word frequency TF (q)_i) The influence on the target similarity is, for example, k may be set to 2 and b may be set to 0.75.

S350: and determining the normalized morpheme weight of each target morpheme.

In a possible implementation manner, the weights of the target morphemes are calculated by using the inverse document frequency values of the target morphemes, and the weights are normalized, so that the values can fall into a relative range, specifically, in the case that the search query word corresponds to a plurality of search results, as shown in fig. 5, the step S350 may include the following steps:

s351: determining an inverse document frequency value of each of the target morphemes in the plurality of search results.

Specifically, taking the search result as a document as an example, the calculation formula of the IDF may be shown in formula (3):

where N represents the number of search results in the search result set, and N (q)_i) I.e., the number of search results in which the target morpheme appears in the title.

Optionally, the calculation formula of the IDF may also be shown in formula (4):

wherein, 0.5 of the numerator and the denominator in the fraction is mainly used for smoothing treatment.

S353: and obtaining the sum of the inverse document frequency values of the target morphemes according to the inverse document frequency value of each target morpheme.

S355: and obtaining the normalized morpheme weight of each target morpheme according to the sum of the inverse document frequency value of each target morpheme and the inverse document frequency value.

Specifically, the sum of the inverse document frequency value and the inverse document frequency value of each target morpheme is subjected to quotient operation, and the obtained numerical value is used as the normalized morpheme weight corresponding to each target morpheme.

S370: and determining the similarity between the search query word and the search result according to the normalized morpheme weight of each target morpheme and the target similarity between each target morpheme and the search result.

In one possible implementation, in the calculation of the similarity between the morphemes and the search results, the amount of words of the search query word is changed for avglength(s) in formula (1), and the coverage rate of the morphemes in the search query word and the titles of the search results is also increased, so that the similarity on the contents is more emphasized by the calculation results. Specifically, as shown in fig. 6, the step S370 may include the following steps:

s371: determining a first percentage of the one or more target morphemes in the search query term.

As described above, the target morpheme may also be referred to as a hit word.

S373: determining a second percentage of the one or more target morphemes in a title of the search result.

S375: and determining the similarity between the search query word and the search result according to the first ratio, the second ratio, and the target similarity and normalized morpheme weight corresponding to each target morpheme.

Illustratively, taking the search result as a document, the modified BM25 algorithm is shown in formula (5):

q in formula (5) represents query, namely a search query word input by a user; s watchThe document is shown, which is the title of the current document in the application, and in other application scenes such as video search and picture search, S can also represent the title of the search result; query _ Cover (Q, S) represents the proportion of the number of the hit words to the number of all the words in the Query, namely the first proportion; sen _ Cover (Q, S) represents the number proportion of the hit words to all the words in the title, namely the second proportion; TF (q)_i) Representing the word frequency of the hit word qi in the search result set; IDF (q)_i) Represents the IDF value (i.e., inverse document frequency value) of the target morpheme (i.e., hit word) qi in the search result set; length (S) represents the number of words in the title, i.e. the second word quantity; length (Q) represents the number of words in the query, namely the first word quantity;

representing the similarity score of qi and S,

representing normalized morpheme weights for qi. In addition, the function of the parameter b is mainly to adjust the influence of the text title length on the similarity, and k is an adjusting factor for the word frequency.

The similarity score between the query and the current document title can be calculated through the formula (5), generally, the score is only 0-1, the smaller the score is, the more dissimilar the score is, and the closer the score is to 1, the more similar the score is.

S250: and determining a classification label of a sample according to a preset classification interval and the similarity, wherein the sample comprises the search query word and the search result.

In an embodiment of the application, based on the similarity score within the range of 0 to 1, a corresponding interval division rule may be set according to manual experience or business requirements, and if the similarity score is divided into 3 grades, so as to construct a subsequent pre-training classifier model, for example:

[ 0-0.3): indicate less similarity;

[ 0.3-0.6): indicates that they are generally similar;

[ 0.6-1 ]: the representation is very similar.

S270: and fine-tuning the pre-trained language model according to the sample and the corresponding classification label to obtain a search result classification model.

In the embodiment of the application, the pre-trained language model refers to a general language model trained by a large amount of sample data, and in order to enable the model to meet the requirements of a specific service, a small amount of sample data related to the specific service is frequently used for training again, so that the parameters of the model are properly adjusted. Specifically, a sample is input into the pre-trained language model, a loss function is calculated according to the model output result and the classification label corresponding to the sample, and then parameters of the pre-trained language model are properly adjusted according to the loss function to obtain a final search result classification model.

In the embodiment of the application, the search result classification model can be applied to a search engine to accurately calculate the similarity between the search query word currently input by the user and the candidate search result corresponding to the search query word in the database, so as to classify the candidate search results, take the candidate search results with high similarity as the final search result and provide the final search result for the user, further meet the actual search requirement of the user, and improve the search experience of the user.

In an embodiment of the application, the classification function more suitable for a specific service is realized by finely adjusting the last layer of classification network in the pre-trained language model.

In a possible embodiment, as shown in fig. 7 in particular, the method may comprise the following steps:

s710: and performing feature extraction on the search query words in the sample based on a first language sub-model in the language model to obtain first text features.

S730: and performing feature extraction on the search query words in the sample based on a second language sub-model in the language model to obtain second text features.

The first language sub-model and the second language sub-model can both adopt a BERT model or both adopt a Word2Vec model to extract text features.

S750: and obtaining a third text characteristic according to the first text characteristic and the second text characteristic.

In a possible implementation manner, the text features obtained after feature extraction are represented as vectors in data types, and specifically, as shown in fig. 8, the step S750 may include the following steps:

s751: and obtaining text difference characteristics according to the first text characteristics and the second text characteristics.

Specifically, the first language submodel extracts features of the search query word and outputs a first feature vector, the first feature vector represents first text features of the search query, the second language submodel extracts features of the search query word and outputs a second feature vector, the second feature vector represents second text features of the search query, the first feature vector and the second feature vector are the same in dimension, so that the first feature vector and the second feature vector can be differed to obtain a feature difference vector, the feature difference vector is used for representing text difference features, and the feature difference vector is also the same as the first feature vector and the second feature vector in vector dimension.

S753: and splicing the first text feature, the second text feature and the text difference feature to obtain a third text feature.

Specifically, the first feature vector, the second feature vector and the feature difference vector are spliced (i.e., the number of dimensions is increased) and combined into a third feature vector representing the third text feature.

S770: and inputting the third text feature into an output layer in the language model to obtain a prediction classification label corresponding to the third text feature.

S790: and finely adjusting the language model according to the prediction classification label and the corresponding classification label to obtain a search result classification model.

For example, as shown in fig. 9, the original pre-training language model may adopt berts, or may adopt word2vec models, and perform initial pre-training vector representation on two sentences, i.e., a search query term query and a document title, respectively, then obtain a corresponding first feature vector u and a second feature vector v by using a classification pooling layer (which may be referred to as a "cls pooling" for short) output by the berts, then splice u, v and | u-v | into a third feature vector (u-v represents a difference between u and v), and then send the third feature vector to an output layer (i.e., a classifier softmax classifier) for classification. The category labels herein may be the category labels under the aforementioned three categories. Through the predicted classification labels and the real classification label fine tuning (fine-tuning) model (mainly a classifier), sentences with higher text similarity can be represented in the model as vectors which are closer, and sentences with low text similarity have more vectors. Optionally, the BERT model includes a plurality of layers of transformers (transforms), and besides extracting vectors from the cls posing layer, vectors output by the first layer of transformers and the last layer of transformers may be averaged to form u and v.

An embodiment of the present application further provides a training apparatus 1000 for a search result classification model, as shown in fig. 10, the apparatus 1000 may include:

a data obtaining module 1010, configured to obtain a training data set, where the training data set includes search query terms and search results corresponding to the search query terms;

a similarity calculation module 1020 for determining similarity of the search query term and the search result based on a target text matching algorithm;

a classification label determining module 1030, configured to determine a classification label of a sample according to a preset classification interval and the similarity, where the sample includes the search query term and the search result;

and the model fine-tuning module 1040 is configured to perform fine-tuning on the pre-trained language model according to the sample and the corresponding classification label, so as to obtain a search result classification model.

In one embodiment of the present application, the similarity calculation module 1020 may include:

the morpheme analyzing unit is used for performing morpheme analysis on the search query words and determining one or more target morphemes in the search query words;

the morpheme similarity calculation unit is used for determining the target similarity between each target morpheme and the search result based on a target text matching algorithm;

a morpheme weight determining unit, configured to determine a normalized morpheme weight of each target morpheme;

and the similarity calculation unit is used for determining the similarity between the search query word and the search result according to the normalized morpheme weight of each target morpheme and the target similarity between each target morpheme and the search result.

In an embodiment of the application, the morpheme similarity calculation unit may include:

a word frequency determining subunit, configured to determine, when a search result set corresponding to the search query word includes one or more search results, a word frequency of the target morpheme in the search result set;

a word quantity determining subunit, configured to determine a first word quantity of the search query word and a second word quantity of a title of the search result;

and the target similarity calculation operator unit is used for obtaining the target similarity of the target morpheme and the search result according to the word frequency, the first word quantity and the second word quantity.

In an embodiment of the application, the morpheme weight determining unit may include:

an inverse document frequency value determining subunit, configured to determine an inverse document frequency value of each of the target morphemes in the search result set if the search result set corresponding to the search query term includes one or more search results;

the summation subunit is used for obtaining the sum of the inverse document frequency values of the target morphemes according to the inverse document frequency value of each target morpheme;

and the normalized morpheme weight determining subunit is used for obtaining the normalized morpheme weight of each target morpheme according to the sum of the inverse document frequency value of each target morpheme and the inverse document frequency value.

In an embodiment of the present application, the similarity calculation unit may include:

a first coverage determination subunit, configured to determine a first percentage of the one or more target morphemes in the search query term;

a second coverage rate determining subunit, configured to determine a second percentage of the one or more target morphemes in a title of the search result;

and the similarity calculation operator unit is used for determining the similarity between the search query word and the search result according to the first ratio, the second ratio, the target similarity and the normalized morpheme weight corresponding to each target morpheme.

In one embodiment of the present application, the model fine tuning module 1040 may include:

the first feature extraction unit is used for extracting features of the search query words in the sample based on a first language sub-model in the language model to obtain first text features;

the second feature extraction unit is used for extracting features of the search query words in the sample based on a second language sub-model in the language model to obtain second text features;

the third feature determining unit is used for obtaining a third text feature according to the first text feature and the second text feature;

the model prediction unit is used for inputting the third text feature into an output layer in the language model to obtain a prediction classification label corresponding to the third text feature;

and the fine tuning unit is used for fine tuning the language model according to the prediction classification label and the corresponding classification label to obtain a search result classification model.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The embodiment of the present application provides a computer device, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for training a search result classification model provided in the above method embodiment.

Fig. 11 is a schematic hardware configuration diagram of an apparatus for implementing a method for training a search result classification model according to an embodiment of the present application, where the apparatus may participate in forming or including an apparatus or system according to an embodiment of the present application. As shown in fig. 11, the device 10 may include one or more (shown with 1002a, 1002b, … …, 1002 n) processors 1002 (the processors 1002 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1004 for storing data, and a transmission device 1006 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 11 is only an illustration and is not intended to limit the structure of the electronic device. For example, device 10 may also include more or fewer components than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

It should be noted that the one or more processors 1002 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the device 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 1004 can be used for storing software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods described in the embodiments of the present application, and the processor 1002 executes various functional applications and data processing by running the software programs and modules stored in the memory 1004, so as to implement the above-mentioned training method for the search result classification model. The memory 1004 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1004 may further include memory located remotely from the processor 1002, which may be connected to the device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1006 is used for receiving or sending data via a network. Specific examples of such networks may include wireless networks provided by the communication provider of the device 10. In one example, the transmission device 1006 includes a network adapter (NIC) that can be connected to other network devices through a base station so as to communicate with the internet. In one example, the transmission device 1006 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the device 10 (or mobile device).

The present application further provides a computer-readable storage medium, which may be disposed in a server to store at least one instruction or at least one program for implementing a method for training a search result classification model in the method embodiments, where the at least one instruction or the at least one program is loaded and executed by the processor to implement a method for training a search result classification model provided in the method embodiments.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Embodiments of the present invention also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute a training method of the search result classification model provided in the above-mentioned various alternative embodiments.

As can be seen from the above embodiments of the method, apparatus, medium, and device for training a search result classification model provided in the present application,

according to the scheme provided by the application, the similarity between the search query word and the search result is calculated through a text matching algorithm, and the classification label of the sample is determined based on the similarity and the preset grading interval, wherein the sample comprises the search query word and the corresponding search result; and then, the existing pre-training language model is finely adjusted through the sample and the classification label of the sample to obtain a search result classification model, and the search result classification model can be applied to search application, so that the obtained search result is more matched with the search query word in the similarity dimension, the search requirement of a user is further met, and the user experience is improved. Specifically, the BM25 formula is rewritten in the text matching algorithm, the morpheme coverage rate and other factors are increased, and the calculation of the similarity of the morpheme weight and the similarity of the morpheme and the search result is improved, so that the text matching algorithm focuses more on the similarity of the search query word and the search result in content to meet the requirement of model application. In addition, compared with the training from the beginning, the training method for fine tuning the pre-trained language model can reduce the required training data amount, save the model training cost and improve the model training efficiency.

It should be noted that fine tuning is a training mode, which means that on the basis of a model trained by a large amount of training data (the model at this time is a general model and is generally open source), a part of training data related to a specific service is used to train the model again, so that the model is more suitable for the requirements of the specific service, therefore, generally only a second round of training is needed, and experiments prove that the model performance can meet the requirements of the specific service only by a small amount of specific training data; and the training data volume is reduced, the training time can be shortened, the model training cost is reduced, and the model training efficiency is improved.

The sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for training a search result classification model, the method comprising:

2. The method of claim 1, wherein determining similarity of the search query term and the search result based on a target text matching algorithm comprises:

performing morpheme analysis on the search query words, and determining one or more target morphemes in the search query words;

determining the target similarity of each target morpheme and the search result based on a target text matching algorithm;

determining a normalized morpheme weight for each of the target morphemes;

and determining the similarity between the search query word and the search result according to the normalized morpheme weight of each target morpheme and the target similarity between each target morpheme and the search result.

3. The method of claim 2, wherein in a case that the search query term corresponds to a plurality of search results, the determining a target similarity of each target morpheme to the search results based on a target text matching algorithm comprises:

determining the word frequency of the target morpheme in a plurality of search results;

determining a first quantity of words of the search query term and a second quantity of words of a title of the search result; the search result is any one of the plurality of search results;

and obtaining the target similarity of the target morpheme and the search result according to the word frequency, the first word quantity and the second word quantity.

4. The method according to claim 3, wherein said determining a normalized morpheme weight for each of said target morphemes comprises:

determining an inverse document frequency value of each of the target morphemes in the plurality of search results;

obtaining the sum of the inverse document frequency values of the target morphemes according to the inverse document frequency value of each target morpheme;

and obtaining the normalized morpheme weight of each target morpheme according to the sum of the inverse document frequency value of each target morpheme and the inverse document frequency value.

5. The method according to claim 2, wherein said determining a similarity of said search query term to said search result based on a normalized morpheme weight of each of said target morphemes and a target similarity of each of said target morphemes to said search result comprises:

determining a first percentage of the one or more target morphemes in the search query term;

determining a second percentage of the one or more target morphemes in a title of the search result;

and determining the similarity between the search query word and the search result according to the first ratio, the second ratio, and the target similarity and normalized morpheme weight corresponding to each target morpheme.

6. The method of claim 1, wherein the fine-tuning the pre-trained language model according to the samples and the corresponding classification labels to obtain a search result classification model comprises:

extracting the characteristics of the search query words in the sample based on a first language sub-model in the language model to obtain first text characteristics;

extracting the characteristics of the search query words in the sample based on a second language sub-model in the language model to obtain second text characteristics;

obtaining a third text characteristic according to the first text characteristic and the second text characteristic;

inputting the third text feature to an output layer in the language model to obtain a prediction classification label corresponding to the third text feature;

and finely adjusting the language model according to the prediction classification label and the corresponding classification label to obtain a search result classification model.

7. The method of claim 6, wherein the text feature is represented as a vector, and wherein deriving a third text feature from the first text feature and the second text feature comprises:

obtaining text difference characteristics according to the first text characteristics and the second text characteristics;

and splicing the first text feature, the second text feature and the text difference feature to obtain a third text feature.

8. An apparatus for training a search result classification model, the apparatus comprising:

9. A computer-readable storage medium, having at least one instruction or at least one program stored thereon, which is loaded and executed by a processor to implement a method of training a search result classification model according to any one of claims 1 to 7.

10. A computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement a method for training a search result classification model according to any one of claims 1 to 7.

11. A computer program product comprising computer instructions which, when executed by a processor, implement a method of training a search result classification model as claimed in any one of claims 1 to 7.