CN113626713A

CN113626713A - Search method, device, equipment and storage medium

Info

Publication number: CN113626713A
Application number: CN202110956827.XA
Authority: CN
Inventors: 王朋恺; 李辉; 陈永生; 杨林凤
Original assignee: Beijing Cheerbright Technologies Co Ltd
Current assignee: Beijing Cheerbright Technologies Co Ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-11-09

Abstract

The invention discloses a searching method, a searching device, a computing device and a storage medium, wherein the method comprises the following steps: acquiring a user search log list, a first positive sample data set and a first negative sample data set; performing word segmentation on the search content sub-item and the document title sub-item, adding a first positive sample data set and a first negative sample data set into user characteristics, and acquiring a search click rate estimation model; acquiring a second positive sample data set of the user search log list, calculating a Jaccard similarity parameter and a Cosine similarity parameter, and acquiring a third positive sample data set and a second negative sample data set of the user search log list; loading Google general corpus and obtaining a BERT semantic similarity model; and constructing a vector index library by using a Faiss framework, and acquiring recalled search results. The method can better represent the depth model of the semantics, greatly improves the recall effect by adjusting and optimizing the semantic similarity model, constructs the compatible semantic vector recall service, and meets the search effect and performance requirements.

Description

Search method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of search, in particular to a search method, a search device, computing equipment and a storage medium suitable for generating and recalling search results in the automobile field based on semantic vectors.

Background

With the increasing popularization of the internet, document searching in the automobile field has become a common technical means, searching mainly comprises two important components of recalling related documents and ranking the recall result according to the search content of a user, and finally returning the most related documents to the user, wherein the recall effect determines the subsequent ranking effect. In the field of automobile vertical search, a large number of client groups with different levels are provided, the search effect directly influences the user search experience, the visit amount, the visitor, the click through rate and other relevant indexes, and in order to improve the search experience and increase the user stickiness, the recall and sequencing strategy needs to be continuously optimized.

In the prior art, the recall strategy mainly comprises two types, wherein the first type is a document for recalling the co-occurrence words through an inverted index based on the word segmentation result of the search content; the second category is the recall of semantically related documents using the overall semantic information of the search content. These two methods have the following disadvantages: firstly, semantic similarity among different sentences and context information in the sentences are not considered, and the problems of synonymy and polysemy of languages and language sequence structure cannot be solved; secondly, the effect of semantic matching is not ideal.

Therefore, a search method capable of meeting the application requirements in the automobile field and greatly improving the satisfaction degree of the recall result is needed.

Disclosure of Invention

To this end, the present invention provides a search method, apparatus, computing device and storage medium in an effort to solve or at least mitigate at least one of the problems identified above.

According to one aspect of the present invention, there is provided a search method adapted to generate a search result for recalling an automobile field based on a semantic vector, the method comprising the steps of: acquiring a user search log list and a first positive sample data set and a first negative sample data set in the user search log list, wherein the user search log list comprises a user address sub-item, a search content sub-item and a document title sub-item; performing word segmentation on a search content sub-item and a document title sub-item according to a user search log list, a first positive sample data set and a first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a word frequency and inverse text product parameter, a Jaccard similarity parameter and a Cosine similarity parameter, adding the first positive sample data set and the first negative sample data set into user characteristics, and obtaining a search click rate prediction model; acquiring a second positive sample data set of the user search log list according to the search click rate pre-estimation model, calculating a Jaccard similarity parameter and a Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list; loading Google general corpus according to the first negative sample data set and the third positive sample data set, setting the longest sequence length, the learning rate and the training round parameters of the text, and acquiring a BERT semantic similarity model; and according to the BERT semantic similarity model, constructing a vector index library by using a Faiss framework, and acquiring a recalled search result.

Optionally, the step of obtaining the user search log list and the first positive sample data set and the first negative sample data set in the user search log list includes: acquiring a user search log list, wherein the user search log list comprises a user address sub-item, a search content sub-item and a document title sub-item; according to the user search log list, dividing the user search log list into a first user search log list, a second user search log list and a third user search log list, wherein the first user search log list is used for recording user address sub-items, display results of search content sub-items and corresponding documents, the second user search log list is used for recording user address sub-items, click results of search content sub-items and corresponding documents, and the third user search log list is used for recording user information; according to the first user search log list, the second user search log list and the third user search log list, aggregating the search content sub-items to obtain a document click rate corresponding to a certain search content sub-item in a set time interval; setting a document click rate threshold value according to the document click rate; when the document click rate corresponding to a certain search content sub-item exceeds a document click rate threshold value, acquiring a first positive sample data set, wherein the first positive sample data set comprises a document title sub-item and a document corresponding to the search content sub-item; and when the document click rate corresponding to a certain search content sub-item does not exceed the document click rate threshold value, acquiring a first negative sample data set, wherein the first negative sample data set comprises the document title sub-item and the document corresponding to the search content sub-item.

Optionally, the step of performing word segmentation processing on the search content sub-item and the document title sub-item according to the user search log list, the first positive sample data set and the first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a product parameter of the word frequency and the inverse text, a Jaccard similarity parameter, and a Cosine similarity parameter includes: performing word segmentation processing on the search content sub-items and the document title sub-items to obtain search content sub-item word segmentation, document title sub-item word segmentation and word segmentation words of the sum of the search content sub-item word segmentation and the document title sub-item word segmentation; calculating the times of the word segmentation words in the first positive sample data set and the first negative sample data set, and dividing the times by the total number of the word segmentation words in the first positive sample data set and the first negative sample data set to obtain word frequency parameters; calculating the total number of documents in a user search log list, and dividing the total number of documents by the number of documents with word segmentation words to obtain an inverse text frequency parameter; calculating the product of the word frequency parameter and the inverse text frequency parameter to obtain a product parameter of the word frequency and the inverse text; calculating the intersection of the search content sub-item participle and the document title sub-item participle, and dividing the intersection by the union of the search content sub-item participle and the document title sub-item participle to obtain a Jaccard similarity parameter; vectorizing the search content sub-item participle and the document title sub-item participle through the product parameter of the word frequency and the inverse text, calculating the Cosine distance of the text vector of the search content sub-item participle and the document title sub-item participle, and obtaining the Cosine similarity parameter.

Optionally, the step of adding the first positive sample data set and the first negative sample data set to the user characteristic to obtain the search click rate prediction model includes: adding the first positive sample data set and the first negative sample data set into the user characteristics to obtain deep model training data; and carrying out depth model training according to the depth model data to obtain a search click rate estimation model.

Optionally, the step of obtaining a second positive sample data set of the user search log list according to the search click rate prediction model includes: according to the search click rate estimation model, carrying out click rate prediction on the documents in the user search log list to obtain the click rate of each document; and setting a document click rate threshold, and acquiring a second positive sample data set of the user search log list when the click rate of the document in the user search log list is greater than the set document click rate threshold.

Optionally, the step of calculating the Jaccard similarity parameter and the Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list includes: acquiring the deduplicated data of the first positive sample data set and the second positive sample data set; calculating the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item in the data after the duplication removal; setting a similarity threshold, when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are simultaneously greater than the similarity threshold, acquiring a third positive sample data set of the user search log list, and when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are not greater than the similarity threshold, acquiring a second negative sample data set of the user search log list.

Optionally, the step of constructing the vector index library using the Faiss framework includes: a full resource amount of inverted keyword-based indexes and vector indexes based on semantic vectors are constructed using a Faiss framework.

According to another aspect of the present invention, a search apparatus is disclosed, which is adapted to generate a search result for recalling an automobile field based on a semantic vector, the apparatus comprising:

the data acquisition module is used for acquiring a user search log list and a first positive sample data set and a first negative sample data set in the user search log list, wherein the user search log list comprises a user address sub-item, a search content sub-item and a document title sub-item;

the model generation module is used for carrying out word segmentation on the search content sub-items and the document title sub-items according to the user search log list, the first positive sample data set and the first negative sample data set, acquiring a word frequency parameter, an inverse text frequency parameter, a product parameter of the word frequency and the inverse text, a Jaccard similarity parameter and a Cosine similarity parameter, adding the first positive sample data set and the first negative sample data set into user characteristics, and acquiring a search click rate prediction model; acquiring a second positive sample data set of the user search log list according to the search click rate pre-estimation model, calculating a Jaccard similarity parameter and a Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list; loading Google general corpus according to the first negative sample data set and the third positive sample data set, setting the longest sequence length, the learning rate and the training round parameters of the text, and acquiring a BERT semantic similarity model;

and the recall result module is used for constructing a vector index library by using a Faiss frame according to the BERT semantic similarity model and acquiring a recalled search result.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the search methods described above.

According to a further aspect of the invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the search methods described above.

According to the searching scheme, a user searching log list and a first positive sample data set and a first negative sample data set in the user searching log list are obtained; performing word segmentation processing on the search content sub-items and the document title sub-items to obtain word frequency parameters, inverse text frequency parameters, word frequency and inverse text product parameters, Jaccard similarity parameters and Cosine similarity parameters, adding a first positive sample data set and a first negative sample data set into user characteristics, and obtaining a search click rate estimation model; acquiring a second positive sample data set of the user search log list according to the search click rate pre-estimation model, calculating a Jaccard similarity parameter and a Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list; loading Google general corpus according to the first negative sample data set and the third positive sample data set, setting the longest sequence length, the learning rate and the training round parameters of the text, and acquiring a BERT semantic similarity model; and according to the BERT semantic similarity model, constructing a vector index library by using a Faiss framework, and acquiring a recalled search result. The method can better represent the depth model of the semantics, greatly improves the recall effect by adjusting and optimizing the semantic similarity model, constructs the compatible semantic vector recall service, and meets the search effect and performance requirements.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a configuration of a computing device 100 according to one embodiment of the invention;

FIG. 2 shows a flow diagram of a search method 200 according to one embodiment of the invention; and

fig. 3 shows a schematic structural diagram of a search apparatus 300 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a block diagram of an example computing device 100. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. In some embodiments, the computing device 100 is configured to perform a search method 200, the method 200 being capable of ensuring that, in subsequent text generation processes, based on the automobile information knowledgemap, it can be incorporated as valid external knowledge, the constructed knowledge graph is subjected to graph convolution operation, the global knowledge representation of the knowledge graph is extracted, the global knowledge representation participates in the decoding operation of a decoder later, the global knowledge representation is used as a hidden variable to guide the decoding calculation process, the purpose of controlling the logic relation between sentences is achieved, meanwhile, the reasoning ability of the graph neural network is exerted, the fact correctness of the generated article is ensured, the calculation result of the graph neural network is fused during decoding, sentence-level decoding and word-level decoding are completed, so that the generated car information content is logically clear in chapter structure, proper words are used in sentences, the details of the expressed content are rich, and the program data 124 contains instructions for executing the method 200.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In some embodiments, one or more programs are stored in a computer-readable medium, including instructions for performing certain methods, such as the search method 200 performed by the computing device 100 according to embodiments of the present invention.

Computing device 100 may be implemented as part of a small-sized portable (or mobile) computing device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.

In the prior art, the searching method mainly includes: firstly, based on the matching of co-occurrence words, a related stop word list is sorted out according to a large-scale corpus, word segmentation is carried out on text titles and contents in the corpus by using a word segmentation tool to obtain word segmentation, stop words are filtered out, the word segmentation is used as a key, and a document id is used as a value to construct an inverted index. And when the user looks at the recall from the user side, performing word segmentation processing on the user search content by using the same word segmentation tool, recalling related results from the inverted index according to each word segmentation, and solving intersection of the recall results of each word segmentation to obtain a final result. The word segmentation can adopt methods such as a jieba tool for loading a vertical domain dictionary, a general binary word model, a maximum word matching algorithm and the like; the method has the disadvantages that semantic similarity among different sentences and context information in the sentences are not considered, the problems of synonymy and polysemous of languages and word sequence structure cannot be solved, in practical application, the matching method based on the vocabulary level still has great limitation, which is mainly reflected in three aspects, namely, the synonymy and polysemous of the languages on one hand, a plurality of different words represent the same semantic meaning, for example, "how much money", "selling price" and "guide price" all represent "price", the same word has different meanings in different context contexts, for example, "discovery" represents both a verb and a next luxury and large SUV of a road tiger flag; on the other hand, the language order structure problem is that sentences with different language orders are constructed by using the same vocabulary, so that different semantics can be expressed, even the semantics can be completely opposite, for example, "Baoma X3 is more cost-effective than Audi A4L" and "Audi A4L is more cost-effective than Baoma X3", and the semantics expressed by the two sentences are completely opposite. Finally, the participles of two texts cannot satisfy the condition of simultaneous co-occurrence, resulting in no recall, but the semantic similarity is high, for example, "which are the various instrument symbols of the car? "and" the car instrument symbol is large "," solve the problem of large sound of closing the door ", and" the sound of closing the door is large, there is a method? ". Secondly, matching based on semantic level, such as word2vec, SimNet, DSSM and other methods, the matching method mainly comprises two steps of semantic representation and similarity calculation: word2vec is a method for word embedding, words are converted into numerical vectors, words are obtained by performing word segmentation on sentences in a corpus, word vectors of each word are obtained, maximum value operation is performed on word vectors of all the words in the dimension direction of the word vectors, and finally semantic vectors of the whole sentence are obtained; SimNet and DSSM both belong to a supervised end-to-end deep semantic matching model, namely, the mode of the final fusion matching of the respective learning semantic expressions of search content and documents is realized, implicit continuous vectors are used for expressing semantics, three-layer structures of an input layer, a presentation layer and a matching layer can be abstracted, the SimNet presentation layer mainly comprises network structure forms such as BOW, CNN, RNN and the like, and the DSSM extends to CNN-DSSM and LSTM-DSSM according to the presentation layer structures, as shown in FIG. 1. Acquiring exposure and click data by analyzing a search log, manufacturing a weak labeling data set, converting search contents and article titles into a vector form through a presentation layer, and training a model by using a cosine distance of a semantic vector; the method has the disadvantages that the semantic matching effect is not ideal, particularly, the semantic recall of long-tail search content and short-length search content is realized, Word2Vec is used as an unsupervised Word vector generation mode and mainly comprises CBOW and Skip-Gram modes, and in the vertical field of automobiles, sentences appearing in the automobile series at the same level have similar modes and have relatively close semantics, such as BMW 5 series and Benz E level; the deep semantic matching model is gradually improved from BOW to CNN to RNN to consider context information, the BOW model does not consider the language order and the context related information, and the effect is poor; the CNN uses the convolution layer to obtain local context information, uses the pooling layer to obtain global context information, but cannot consider the relation between words with longer position intervals, and has limited effect; the RNN mainly solves the long-term dependence problem based on the LSTM network structure, but usually requires a large number of manually labeled samples for training, and a good effect can be obtained in application, which is currently difficult to satisfy.

FIG. 2 shows a flow diagram of a search method 200 according to one embodiment of the invention. As shown in FIG. 2, the method 200 is adapted to generate a recall automobile field search result based on a semantic vector, and the method 200 begins with step S210 of obtaining a user search log list, the user search log list including a user address sub-item, a search content sub-item, and a document title sub-item, and a first positive sample data set and a first negative sample data set in the user search log list.

Specifically, since the search log list includes all documents searched by the user, the user address sub-item included in the search log list includes address information of the user; the user search content sub-item includes all words or sentences used for searching; the document title sub-items comprise the title contents of all articles displayed after searching, the first positive sample data set comprises documents of which the click rate exceeds a certain threshold value in a user search log list, the first negative sample data set is documents of which the click rate is lower than the threshold value, and the first positive sample data set and the first negative sample data set can screen out which documents are clicked, which documents are not clicked, and the click rate is high, which indicates that the documents have more useful information, the user browses more, the click rate is low, which indicates that the documents have less useful information or the user browses less.

Specifically, according to the embodiment of the present invention, the step of obtaining the user search log list and the first positive sample data set and the first negative sample data set in the user search log list includes:

acquiring a user search log list, wherein the user search log list comprises a user address sub-item, a search content sub-item and a document title sub-item;

dividing the user search log list into a first user search log list, a second user search log list and a third user search log list according to the user search log list, wherein the first user search log list is used for recording the user address sub-item, the display result of the search content sub-item and the corresponding document, the second user search log list is used for recording the user address sub-item, the click result of the search content sub-item and the corresponding document, and the third user search log list is used for recording user information including gender, age, preference brand, manufacturer, vehicle family, vehicle type and the like;

according to the first user search log list, the second user search log list and the third user search log list, aggregating the search content sub-items to obtain a document click rate corresponding to a certain search content sub-item in a set time interval; specifically, by associating the first user search log list with the second user search log list, which documents are displayed and clicked and which documents are displayed but not clicked under a certain search content sub-item can be obtained, and further by associating the third user search log list, relevant information of the search user can be obtained. By using the first user search log list, the second user search log list and the third user search log list, and aggregating the search content sub-items as a unit, the document click rate can be mined when a certain search content sub-item is used for searching within a certain set time period, and the document click rate is the number of times of document click divided by the number of times of searched out.

Setting a document click rate threshold value according to the document click rate; specifically, the document click rate threshold is a reference value for judging the probability of the document being clicked, that is, if the document click rate is greater than the document click rate threshold, the probability of the document being clicked is relatively high, otherwise, the probability of the document being clicked is relatively low;

when the document click rate corresponding to a certain search content sub-item exceeds the document click rate threshold value, acquiring a first positive sample data set, wherein the first positive sample data set comprises a document title sub-item and a document corresponding to the search content sub-item;

and when the document click rate corresponding to a certain search content sub-item does not exceed the document click rate threshold value, acquiring a first negative sample data set, wherein the first negative sample data set comprises a document title sub-item and a document corresponding to the search content sub-item.

Through step S220, according to the user search log list, the first positive sample data set and the first negative sample data set, performing word segmentation on the search content sub-item and the document title sub-item, obtaining a word frequency parameter, an inverse text frequency parameter, a product parameter of word frequency and inverse text, a Jaccard similarity parameter, and a Cosine similarity parameter, adding the first positive sample data set and the first negative sample data set to a user characteristic, and obtaining a search click rate prediction model.

Specifically, through the steps, the filtering of the sample is realized, for example, due to the fact that mistaken clicking exists and the resource titles of the searched document parts are not consistent, the semantic similarity between the clicked search content sub-item and the document title sub-item is low, the literal matching similarity is calculated through the Jaccard similarity parameter, the Cosine similarity parameter and the like, the threshold value is set to filter the clicked user search log list data under the search content sub-item, then the click rate in the user search log list is arranged to serve as the sample through sample expansion, the search deep model carries out document click rate estimation on the search content sub-item and the document title sub-item to further expand the sample amount of the deep model, and the influence of the click sparse problem is reduced to the greatest extent.

Specifically, according to the embodiment of the present invention, the step of performing word segmentation on the search content sub-item and the document title sub-item according to the user search log list, the first positive sample data set and the first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a word frequency-inverse text product parameter, a Jaccard similarity parameter, and a Cosine similarity parameter includes:

performing word segmentation processing on the search content sub-items and the document title sub-items to obtain search content sub-item word segmentation, document title sub-item word segmentation and word segmentation words of the sum of the search content sub-item word segmentation and the document title sub-item word segmentation; specifically, the word segmentation is to divide a sentence into a plurality of independent words, such as: the method is characterized in that the problem of loud door closing sound of the vehicle door is solved, and after the stop word is removed, the method is divided into the following words of solving, closing, loud and problematic door closing sound of the vehicle door.

Calculating the times of the word segmentation words in the first positive sample data set and the first negative sample data set, and dividing the times by the total number of the word segmentation words in the first positive sample data set and the first negative sample data set to obtain word frequency parameters;

calculating the total number of documents in the user search log list, and dividing the total number of documents by the number of documents with the word segmentation words to obtain an inverse text frequency parameter;

calculating the product of the word frequency parameter and the inverse text frequency parameter to obtain a product parameter of the word frequency and the inverse text;

calculating the intersection of the search content sub-item participle and the document title sub-item participle, and dividing the intersection by the union of the search content sub-item participle and the document title sub-item participle to obtain a Jaccard similarity parameter; specifically, the calculation formula of the Jaccard similarity parameter is as follows:

in the formula, A represents a search content sub-item word, B represents a document title sub-item word, J (A, B) represents a Jaccard similarity parameter, A ^ B represents an intersection of the search content sub-item word and the document title sub-item word, and A ^ B represents a union of the search content sub-item word and the document title sub-item word.

Vectorizing the search content sub-item participle and the document title sub-item participle through the product parameter of the word frequency and the inverse text, calculating the Cosine distance of the text vector of the search content sub-item participle and the document title sub-item participle, and obtaining the Cosine similarity parameter. Specifically, in one specific embodiment, for example: the search content sub-item is 'solve the problem of large sound of closing the door', and the post-participle of 'removing the stop word' is 'solve, door, close door, sound, large, problem'; the title sub-item of the document is 'the door closing sound is big and the method is available', and the word is divided into 'the door closing sound is big and the method is available' after the stop word is removed; and calculating a product parameter value of the word frequency and the inverse text corresponding to each participle through the product parameter of the word frequency and the inverse text to obtain a search content sub-item which is [0.1, 0.3, 0.1, 0.4, 0.2, 0.3], wherein a document title sub-item which is [0.3, 0.1, 0.1, 0.2, 0.3, 0.1] is a text vector of the search content sub-item participle and the document title sub-item participle. The formula for calculating the Cosine distance of the text vector of the search content sub-item segmentation and the document title sub-item segmentation is as follows:

in the formula, x represents a text vector of the word segmentation of the search content sub-item, y represents a text vector of the word segmentation of the document title sub-item, and i represents the ith numerical value in the text vector.

Specifically, according to the embodiment of the present invention, the step of adding the first positive sample data set and the first negative sample data set to the user characteristics to obtain the search click rate prediction model includes:

adding the first positive sample data set and the first negative sample data set into user characteristics to obtain deep model training data; specifically, the user characteristics include: word frequency parameter, inverse text frequency parameter, word frequency and inverse text product parameter, Jaccard similarity parameter, Cosine similarity parameter, user gender, user age, whether preferred vendors are in the search content sub-item, whether preferred brands are in the search content sub-item, whether preferred vehicle family is in the search content sub-item, whether preferred vehicle type is in the search content sub-item, whether preferred vendors are in the document title sub-item, whether preferred brands are in the document title sub-item, whether preferred vehicle family is in the document title sub-item, whether preferred vehicle type is in the document title sub-item.

And carrying out depth model training according to the depth model data to obtain a search click rate estimation model. Specifically, in the embodiment of the application, the depth model is a depth model deep fm, and the search click rate estimation model is obtained by training the depth model deep fm.

Through the step S230, a second positive sample data set of the user search log list is obtained according to the search click rate pre-estimation model, the Jaccard similarity parameter and the Cosine similarity parameter are calculated according to the first positive sample data set and the second positive sample data set, and a third positive sample data set and a second negative sample data set of the user search log list are obtained.

Specifically, according to the embodiment of the present invention, the step of obtaining a second positive sample data set of the user search log list according to the search click rate estimation model includes:

according to the search click rate estimation model, carrying out click rate prediction on the documents in the user search log list to obtain the click rate of each document;

setting a document click rate threshold, and acquiring a second positive sample data set of the user search log list when the click rate of the document in the user search log list is greater than the set document click rate threshold. Through the second positive sample data set, the documents with fewer clicks can be removed, and the documents with more clicks can be obtained.

Specifically, according to the embodiment of the present invention, the step of calculating the Jaccard similarity parameter and the Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring the third positive sample data set and the second negative sample data set of the user search log list includes:

acquiring the deduplicated data of the first positive sample data set and the second positive sample data set; specifically, since the first positive sample data set is obtained by carrying out click rate statistics on documents in the user search log list, and the second positive sample data set is obtained by predicting the documents in the user search log list by using the search click rate prediction model, the first positive sample data set and the second positive sample data set have partial data repetition, so that the repeated data in the first positive sample data set and the second positive sample data set need to be deduplicated according to { search content subentry, document title subentry }, and sample data with the highest predicted search click rate is reserved.

Calculating the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item in the data after the duplication removal;

setting a similarity threshold, when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are simultaneously greater than the similarity threshold, acquiring a third positive sample data set of a user search log list, and when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are not greater than the similarity threshold, acquiring a second negative sample data set of the user search log list.

Through the step S240, the Google universal corpus is loaded according to the first negative sample data set and the third positive sample data set, the longest sequence length of the text, the learning rate, and the training round parameter are set, and the BERT semantic similarity model is obtained. The context information of the document is utilized, a BERT network structure is adopted to train a semantic similarity model, and whether two sentences are similar in semantic can be effectively distinguished.

Specifically, each input of the BERT semantic similarity model is obtained by adding corresponding word features, segment features and position features, the BERT semantic similarity model is a depth bidirectional semantic representation model based on a transform network structure, the transform is a bidirectional depth model based on an attention-free mechanism, context information in a text can be fully utilized, the left half part is a coding part, the right half part is a decoding part, and the network structure of each basic unit comprises: the method comprises a multi-head attention mechanism, summation normalization, a feed-forward neural network and summation normalization, wherein the number of the steps is 4. The Token coding dimension of the BERT semantic similarity model is 384, the total number of layers of the transform is 12, the number of heads of the self-attention mechanism is 12, and the hidden layer dimension of the feed forward is 1536.

Specifically, in the embodiment of the present application, the longest sequence length of the text is 100, the learning rate is 0.000002, and according to the convergence rate and the convergence effect, the parameter of the learning rate can be adjusted, the parameter of the training round is 50, and each round means that all data in the training set are trained once through the network structure to participate in the gradient update of the model.

Through step S250, a vector index library is constructed using a Faiss framework according to the BERT semantic similarity model, and a recalled search result is obtained.

Specifically, according to the embodiment of the present invention, the step of constructing the vector index library using the Faiss framework includes: a full resource amount of inverted keyword-based indexes and vector indexes based on semantic vectors are constructed using a Faiss framework.

Specifically, Faiss is a frame for providing efficient similarity search and clustering for vectors, which is open source of Facebook AI, and has a plurality of retrieval methods: indexFlat, belonging to violent search, the most accurate query retrieval speed is very slow and is generally used as a baseline; indexlvflat, which is searched by combining inverted indexes at certain clustering centers after K-Means clustering, and gives consideration to performance and effect; indexIVFPQ, which is compressed when storing vectors, and is approximately retrieved.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 3, there is provided a search apparatus 300, the apparatus 300 comprising: the device comprises a data acquisition module, a model generation module and a recall result module.

the model generation module is used for carrying out word segmentation on the search content sub-item and the document title sub-item according to the user search log list, the first positive sample data set and the first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a word frequency and inverse text product parameter, a Jaccard similarity parameter and a Cosine similarity parameter, adding the first positive sample data set and the first negative sample data set into user characteristics and obtaining a search click rate estimation model; acquiring a second positive sample data set of a user search log list according to the search click rate pre-estimation model, calculating a Jaccard similarity parameter and a Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list; loading Google general corpus according to the first negative sample data set and the third positive sample data set, setting the longest sequence length of the text, the learning rate and the training round parameters, and acquiring a BERT semantic similarity model;

Specifically, in another embodiment of the present application, the data obtaining module is configured to obtain a user search log list, where the user search log list includes a user address sub-item, a search content sub-item, and a document title sub-item; dividing the user search log list into a first user search log list, a second user search log list and a third user search log list according to the user search log list, wherein the first user search log list is used for recording the user address sub-items, the display results of the search content sub-items and corresponding documents, the second user search log list is used for recording the user address sub-items, the click results of the search content sub-items and corresponding documents, and the third user search log list is used for recording user information; according to the first user search log list, the second user search log list and the third user search log list, aggregating the search content sub-items to obtain a document click rate corresponding to a certain search content sub-item in a set time interval; setting a document click rate threshold value according to the document click rate; when the document click rate corresponding to a certain search content sub-item exceeds the document click rate threshold value, acquiring a first positive sample data set, wherein the first positive sample data set comprises a document title sub-item and a document corresponding to the search content sub-item; and when the document click rate corresponding to a certain search content sub-item does not exceed the document click rate threshold value, acquiring a first negative sample data set, wherein the first negative sample data set comprises a document title sub-item and a document corresponding to the search content sub-item.

Specifically, in another embodiment of the present application, the model generating module is configured to perform word segmentation processing on the search content sub-item and the document title sub-item, and obtain word segmentation words of the search content sub-item, the document title sub-item, and a sum of the search content sub-item and the document title sub-item; calculating the times of the word segmentation words in the first positive sample data set and the first negative sample data set, and dividing the times by the total number of the word segmentation words in the first positive sample data set and the first negative sample data set to obtain word frequency parameters; calculating the total number of documents in the user search log list, and dividing the total number of documents by the number of documents with the word segmentation words to obtain an inverse text frequency parameter; calculating the product of the word frequency parameter and the inverse text frequency parameter to obtain a product parameter of the word frequency and the inverse text; calculating the intersection of the search content sub-item participle and the document title sub-item participle, and dividing the intersection by the union of the search content sub-item participle and the document title sub-item participle to obtain a Jaccard similarity parameter; vectorizing the search content sub-item participle and the document title sub-item participle through the product parameter of the word frequency and the inverse text, calculating the Cosine distance of the text vector of the search content sub-item participle and the document title sub-item participle, and obtaining the Cosine similarity parameter.

Specifically, in another embodiment of the present application, the model generation module is configured to add the first positive sample data set and the first negative sample data set to a user feature to obtain deep model training data; and carrying out depth model training according to the depth model data to obtain a search click rate estimation model.

Specifically, in another embodiment of the present application, the model generation module is configured to predict click rates of documents in the user search log list according to the search click rate prediction model, and obtain a click rate of each document; setting a document click rate threshold, and acquiring a second positive sample data set of the user search log list when the click rate of the document in the user search log list is greater than the set document click rate threshold.

Specifically, in another embodiment of the present application, the model generation module is configured to obtain deduplicated data of a first positive sample data set and a second positive sample data set; calculating the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item in the data after the duplication removal; setting a similarity threshold, when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are simultaneously greater than the similarity threshold, acquiring a third positive sample data set of a user search log list, and when the Jaccard similarity parameter and the Cosine similarity parameter of the search content sub-item and the document title sub-item are not greater than the similarity threshold, acquiring a second negative sample data set of the user search log list.

Specifically, in another embodiment of the present application, the recall result module is configured to construct a full resource amount of keyword-based inverted index and a semantic vector-based vector index using a Faiss framework.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A search method adapted to generate recall automotive field search results based on semantic vectors, the method comprising the steps of:

acquiring a user search log list and a first positive sample data set and a first negative sample data set in the user search log list, wherein the user search log list comprises a user address sub-item, a search content sub-item and a document title sub-item;

performing word segmentation on the search content sub-item and the document title sub-item according to the user search log list and the first positive sample data set and the first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a product parameter of the word frequency and the inverse text, a Jaccard similarity parameter and a Cosine similarity parameter, adding the first positive sample data set and the first negative sample data set into user characteristics, and obtaining a search click rate prediction model;

acquiring a second positive sample data set of a user search log list according to the search click rate pre-estimation model, calculating a Jaccard similarity parameter and a Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list;

loading Google general corpus according to the first negative sample data set and the third positive sample data set, setting the longest sequence length of the text, the learning rate and the training round parameters, and acquiring a BERT semantic similarity model;

and according to the BERT semantic similarity model, constructing a vector index library by using a Faiss framework, and acquiring a recalled search result.

2. The method of claim 1, wherein the obtaining a user search log list, and a first set of positive sample data and a first set of negative sample data in the user search log list comprises:

dividing the user search log list into a first user search log list, a second user search log list and a third user search log list according to the user search log list, wherein the first user search log list is used for recording the user address sub-items, the display results of the search content sub-items and corresponding documents, the second user search log list is used for recording the user address sub-items, the click results of the search content sub-items and corresponding documents, and the third user search log list is used for recording user information;

according to the first user search log list, the second user search log list and the third user search log list, aggregating the search content sub-items to obtain a document click rate corresponding to a certain search content sub-item in a set time interval;

setting a document click rate threshold value according to the document click rate;

3. The method of claim 1, wherein the step of performing word segmentation on the search content sub-item and the document title sub-item according to the user search log list, the first positive sample data set and the first negative sample data set to obtain a word frequency parameter, an inverse text frequency parameter, a product parameter of word frequency and inverse text, a Jaccard similarity parameter, and a Cosine similarity parameter comprises:

performing word segmentation processing on the search content sub-items and the document title sub-items to obtain search content sub-item word segmentation, document title sub-item word segmentation and word segmentation words of the sum of the search content sub-item word segmentation and the document title sub-item word segmentation;

calculating the intersection of the search content sub-item participle and the document title sub-item participle, and dividing the intersection by the union of the search content sub-item participle and the document title sub-item participle to obtain a Jaccard similarity parameter;

vectorizing the search content sub-item participle and the document title sub-item participle through the product parameter of the word frequency and the inverse text, calculating the Cosine distance of the text vector of the search content sub-item participle and the document title sub-item participle, and obtaining the Cosine similarity parameter.

4. The method of claim 1, wherein the step of adding the first positive sample data set and the first negative sample data set to the user characteristics to obtain the search click rate pre-estimation model comprises:

adding the first positive sample data set and the first negative sample data set into user characteristics to obtain deep model training data;

and carrying out depth model training according to the depth model data to obtain a search click rate estimation model.

5. The method of claim 1, wherein said obtaining a second positive sample data set of a user search log list according to said search click-through rate prediction model comprises:

setting a document click rate threshold, and acquiring a second positive sample data set of the user search log list when the click rate of the document in the user search log list is greater than the set document click rate threshold.

6. The method according to claim 1, wherein the step of calculating the Jaccard similarity parameter and the Cosine similarity parameter according to the first positive sample data set and the second positive sample data set, and acquiring a third positive sample data set and a second negative sample data set of the user search log list comprises:

acquiring the deduplicated data of the first positive sample data set and the second positive sample data set;

7. The method of claim 1, wherein the step of constructing a vector index library using a Faiss framework comprises:

a full resource amount of inverted keyword-based indexes and vector indexes based on semantic vectors are constructed using a Faiss framework.

8. A search apparatus adapted to generate recall automobile domain search results based on semantic vectors, the apparatus comprising:

9. A computing device, comprising:

one or more processors; and

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-7.

10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.