WO2020108608A1 - 搜索结果处理方法、装置、终端、电子设备及存储介质 - Google Patents

搜索结果处理方法、装置、终端、电子设备及存储介质 Download PDF

Info

Publication number
WO2020108608A1
WO2020108608A1 PCT/CN2019/121928 CN2019121928W WO2020108608A1 WO 2020108608 A1 WO2020108608 A1 WO 2020108608A1 CN 2019121928 W CN2019121928 W CN 2019121928W WO 2020108608 A1 WO2020108608 A1 WO 2020108608A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
search result
result
vector
similarity
Prior art date
Application number
PCT/CN2019/121928
Other languages
English (en)
French (fr)
Inventor
吴逸峰
颜强
郑文豪
陈晓寅
詹德川
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020108608A1 publication Critical patent/WO2020108608A1/zh
Priority to US17/200,128 priority Critical patent/US11586637B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of computer technology, and in particular to search result processing methods, devices, terminals, electronic equipment, and storage media.
  • the user can obtain a corresponding search result list according to the search keyword, and then calculate the text similarity through an algorithm model, sort according to the similarity score, and return the sorted search result list.
  • the algorithm model of similarity calculation in traditional technologies is usually aimed at the application scenarios of a single search requirement, such as pan-demand, which does not consider the different search needs of users, and ignores the influence of text word order and context, and the similarity score is high.
  • search results do not necessarily meet the needs of users, resulting in a list of search results that is sorted according to similarity, and the top search results are not necessarily the user’s most satisfactory or desired results.
  • the embodiments of the present application provide a search result processing method, device, terminal, electronic device, and storage medium to solve the problem that the similarity calculation of search results in traditional technologies is inaccurate and cannot meet different search needs of users.
  • An embodiment of the present application provides a search result processing method, which is executed by an electronic device and includes:
  • the similarity between the search result and the search keyword is obtained.
  • Another embodiment of the present application provides a search result processing apparatus, including:
  • the acquisition module is used to obtain each search result according to the search keywords
  • the exact matching module is used to obtain the exact matching score of the search result for each search result
  • a semantic matching module for each search result, determines a semantic matching weight vector of the search result, and a semantic representation vector of the search keyword and the search result, and according to the semantic representation vector and the semantic matching weight Vector to obtain the semantic matching score of the search result;
  • the obtaining module is configured to obtain the similarity between the search result and the search keyword according to the exact matching score and the semantic matching score for each search result.
  • Another embodiment of the present application provides a terminal, including:
  • the first receiving module is used to receive the input search keywords
  • the sending module is used to send the received search keywords to the server, so that the server executes any one of the above search result processing methods to obtain the similarity between each search result and the search keyword, and according to each search result and all Describing the similarity of search keywords to obtain sorted search results;
  • the second receiving module is used to receive the sorted search results returned by the server
  • the display module is used to display the sorted search results.
  • Another embodiment of the present application provides an electronic device, including:
  • At least one memory for storing program instructions
  • At least one processor configured to call program instructions stored in the memory, and execute any one of the above search result processing methods according to the obtained program instructions.
  • Another embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any of the foregoing search result processing methods is implemented.
  • FIG. 1 is a schematic diagram of an application architecture of each of the embodiments of the present application.
  • FIG. 3 is a schematic structural diagram of a similarity model in an embodiment of this application.
  • FIG. 4 is a schematic diagram of a structure of a first convolution network with an exact match in an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a first fully connected network that accurately matches in an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of a third fully connected network for semantic matching in an embodiment of this application.
  • FIG. 7 is a flowchart of a similarity model training method in an embodiment of this application.
  • FIG. 10 is a flowchart of a method for constructing a training set of dual data in an embodiment of the present application
  • FIG. 11 is a flowchart of a method for constructing a training set of triple data in an embodiment of the present application
  • FIG. 12 is a schematic diagram of a search-search interface in a specific application scenario in an embodiment of this application.
  • FIG. 13 is a schematic diagram of a search result display interface in a specific application scenario in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a search result display interface after sorting in a specific application scenario in an embodiment of this application;
  • FIG. 15 is a schematic structural diagram of a search result processing device in an embodiment of this application.
  • 16 is a schematic structural diagram of another search result processing device in an embodiment of the present application.
  • 17 is a schematic structural diagram of a terminal in an embodiment of the present application.
  • 18 is a schematic structural diagram of an electronic device in an embodiment of this application.
  • FIG. 1 it is a schematic diagram of an application scenario architecture of each of the embodiments of the present application, at least including a terminal and a server.
  • the terminal can be any smart device such as a smart phone, tablet computer, portable personal computer, smart TV, etc.
  • Various applications Application, APP
  • APP Application, APP
  • Search the required information through the terminal's APP.
  • the server can provide various network services for the terminal.
  • the server can be regarded as a background server that provides corresponding network services.
  • the server may be a server, a server cluster composed of several servers, or a cloud computing center, which is not limited.
  • the terminal and the server can be connected through the Internet to achieve mutual communication.
  • the search result processing method can be applied to the terminal or the server without limitation, and the similarity model training method can be applied to the server.
  • the embodiments of the present application are mainly directed to, for example, vertical search application scenarios.
  • users usually have multiple search requirements, such as pan requirements and addressing requirements. Of course, they are not limited to this application scenario.
  • vertical search is a professional search engine for a certain industry, it is the subdivision and extension of search engines, it is to integrate a certain type of special information in the library, and extract the required data for processing by targeted sub-fields It is returned to the user in some form, for example, in search-search, it refers to the search of a specific type of results, such as public account search, applet search, etc.
  • the general demand refers to that the purpose of the user search is to find a goal that meets a certain type of demand, rather than a certain fixed goal, and the result is obtained by inputting demand keywords. Addressing requirements mean that the user’s search purpose is to find a fixed target, but the exact name cannot be determined, and similar keywords are required to search to obtain results.
  • the implementation method provided in the embodiments of the present application performs precise matching and semantic matching of search keywords and each search result respectively.
  • exact matching not only accurate matching is considered, but also semantic information is combined.
  • the semantic matching weight vector of each search result is determined to fully tap the user's search needs, and the semantic matching weight vector is determined according to the search needs not used by users
  • To adjust the impact of semantic matching and exact matching on the final similarity so as to determine the semantic matching score according to the semantic matching weight vector, and finally determine the similarity of each search result based on the exact matching score and the semantic matching score, which is more suitable for users.
  • This scenario of search needs makes the search results with higher similarity more in line with user needs and improves accuracy, thereby making the search results with higher similarity more satisfy user needs.
  • the sorting operation in the embodiment of the present application may also be performed on a terminal or a server, which is not limited, and when displaying search results, the terminal may directly obtain search results with the highest similarity based on the similarity. , And display them in order from high to low similarity, you can also get the sorted search results, and display them in order according to the sorted search results.
  • search result processing and the similarity model training method are applied to the application architecture diagram shown in FIG. 1 as an example for schematic description.
  • FIG. 2 it is a flowchart of a search result processing method in an embodiment of the present application.
  • the method includes:
  • Step 200 Obtain each search result according to the search keyword.
  • performing step 200 may obtain each search result by matching.
  • the search keywords are segmented and matched based on each participle of the search keyword to obtain a search result set corresponding to each participle match, and the intersection of each search result set is taken as each search result corresponding to the search keyword.
  • Step 210 Obtain the exact matching score of the search result.
  • step 210 specifically includes:
  • the word segmentation results corresponding to the search keyword and the search result are obtained by using a multi-granular word segmentation method.
  • the search keywords and each search result are separately segmented. Specifically, if the search keywords and the search results include high-frequency words in a preset high-frequency word set, the high-frequency words are divided, and the rest of the text is divided into words to obtain the search keywords and Word segmentation results corresponding to each search result.
  • the preset high frequency word set can first use the word segmentation tool to segment the data, and count the word frequency, and select high frequency words from the word frequency.
  • the search keyword and the word segmentation result corresponding to the search result are precisely matched to obtain an interaction matrix between the search keyword and the search result.
  • the influence of the relative positional relationship between the search keywords and the word segmentation results of the search results is considered, so that it can more accurately describe whether the two texts are similar.
  • D (D ij
  • the interaction feature MF may be considered to be a second-order interaction matrix.
  • the information contained in the interaction feature MF carries local language structural features.
  • the convolutional network is mainly used for extracting features of different dimensions, for example, using a convolutional neural network (Convolutional Neural Networks, CNN) network.
  • the first convolutional network may be a two-dimensional convolutional network, including a 5-layer convolutional network, which may be specifically set according to actual conditions, and is not limited in the embodiment of the present application.
  • the first convolutional network The convolution kernel and other parameters are pre-trained to meet the needs of feature dimension calculation.
  • the convolution feature is flattened, and the flattening is its vector representation, which is input to the first fully connected network to facilitate subsequent calculation of the fully connected network.
  • the fully-connected network is mainly used for spatial position transformation.
  • the first fully-connected network may include three fully-connected layers. The parameters of each fully-connected layer in the first fully-connected network are pre-trained. You can finally map the convolutional features to a one-dimensional vector to get an exact match score.
  • Step 220 Determine a semantic matching weight vector of the search result, and a semantic representation vector of the search keyword and the search result, and obtain a semantic match of the search result according to the semantic representation vector and the semantic matching weight vector fraction.
  • step 220 when performing step 220, for short text matching, when determining the semantic matching score, the relationship between words needs to be investigated. In order to be able to capture the importance of words, an interaction matrix is used at this time to determine the semantic matching score.
  • the second convolutional network is a one-dimensional convolutional network, including a layer of convolution.
  • the second fully connected network includes a fully connected layer, and through the second fully connected network, a semantic matching weight vector is obtained by mapping.
  • the vector representation of the convolution feature may be input to the second fully connected network.
  • the interaction matrix obtained by accurate matching is introduced into the semantic matching part to obtain the semantic matching weight vector, so that the semantic matching part can more reasonably capture the parts of the text that need attention, by training the second
  • the convolutional network and the second fully-connected network enable the semantic matching weight vector to adjust the influence of the semantic matching score and the exact matching score on the final similarity result according to different search requirements, taking into account the user's general search needs and addressing needs.
  • search keywords and the semantic representation vector of the search result are determined respectively.
  • Multi-granular word segmentation is adopted to obtain the word segmentation results corresponding to the search keywords and each search result respectively.
  • the search keywords and each search result are segmented into word segments.
  • the word vectors of each participle in the search keyword and the participle results corresponding to each search result are obtained respectively, and the search keywords are obtained according to the word vectors of each participle in the corresponding word segmentation results
  • the word with each search result represents a matrix.
  • the preset word vector model can map each word segmentation in the word segmentation result into a vector, that is, the word vector corresponding to each word segmentation can be obtained, so that the word vectors of all word segmentations in the word segmentation result constitute corresponding search keywords or
  • the word of the search result represents a matrix. For example, if the search keyword is divided into 15 participles, and each participle is mapped to a 300-dimensional word vector, the word representation matrix of the search keyword is a 15*300 size matrix.
  • the word2vec model can be used as the above word vector model.
  • the word2vec model is a shallow and double-layer neural network. Under the assumption of the word bag model in word2vec, the order of words is not important. After the training is completed, the word2vec model can map each word into a vector, representing the word and the word. This relationship is the hidden layer of the neural network.
  • each word representation matrix (namely, the search keyword's word representation matrix and the search result's word representation matrix) into the pre-trained third convolutional network to obtain a third convolutional network for each word representation matrix
  • Each word output after feature extraction represents the convolution feature corresponding to the matrix.
  • the third convolutional network is a one-dimensional convolutional network, including a layer of convolution.
  • the convolutional features for example, vector representations
  • the third fully connected network includes two fully connected layers.
  • the third convolutional network and the third fully connected network are trained in advance, so that the search keywords and the semantic representation vector of each search result can be obtained according to the trained third convolutional network and the third fully connected network.
  • RNN method can be used, and different methods can be selected according to actual needs, which are not limited in the embodiments of the present application.
  • the semantic representation vector after merging and stitching is a 64-dimensional vector.
  • the semantic matching score is among them Represents a vector dot product operation, w is a semantic matching weight vector, and f is a semantic representation vector after merging and splicing.
  • the semantic representation vector and the corresponding semantic matching weight vector are multiplied to obtain the final semantic matching score.
  • Step 230 Obtain the similarity between the search result and the search keyword according to the exact match score and the semantic match score of the search result.
  • step 230 specifically includes: adding the exact matching score and the semantic matching score, and adding the added sum as the similarity between the search result and the search keyword.
  • the above search result processing method may be completed based on a similarity model.
  • the method of acquiring the similarity between each search result and the search keyword is: inputting the search keyword and each search result into To the similarity model pre-trained, through the similarity model, after executing the search result processing method in the above embodiment, the similarity between each search result and the search keyword is output.
  • the above steps 200-230 that is, the processing method of search results
  • the above steps 200-230 are implemented by a similarity model, and based on the similarity model obtained in advance, the search keywords and search results are input to In the similarity model, the similarity between search keywords and search results is output.
  • the similarity between each search result and the search keyword is obtained, and the similarity can be applied. Specifically, a possible implementation manner is provided, based on the similarity between each search result and the search keyword, Sort the search results separately to obtain the sorted search results.
  • the semantic matching weight vector can be used to adjust the impact of semantic matching and exact matching on the final similarity.
  • the semantic matching weight vector is lower, reducing the semantic matching pairs.
  • the impact of the final result when the search request is more biased towards the general demand, the semantic matching weight vector is higher, reducing the impact of exact matching on the final result, which can make the final similarity, can meet different search needs, and make the final ranking The top search results better meet user needs.
  • the similarity model is created and trained to obtain the similarity between each search result and the search keyword, that is, the above steps 200-230 are mainly implemented based on the similarity model. Therefore, the following is adopted For specific application scenarios, taking the application of searching for public accounts in WeChat as an example, the algorithm of the similarity model will be described in detail.
  • FIG. 3 it is a schematic structural diagram of a similarity model in an embodiment of the present application, in which, for example, the search keyword and one of the search results, that is, calculating the similarity between the search key and one of the search results are used as an example for description.
  • the input of the similarity model is a search keyword and a search result.
  • the search keyword is a character string of length L 1
  • the search result is a character string of length L 2 .
  • the search result is the public account title. Based on Figure 3, it is mainly divided into the following steps to introduce:
  • Step 1 Perform text preprocessing on the search keywords and search results separately.
  • Step 2 Use multi-granular word segmentation to segment the pre-processed search keywords and search results.
  • the search keywords and the search results are separately segmented. If the search keywords and search results contain the high-frequency words in the preset high-frequency word set, the high-frequency words are divided, And the rest of the text is divided into words, and the word segmentation results corresponding to the search keywords and the search results are obtained respectively.
  • a fixed length method is adopted, specifically, according to a preset fixed length, if the search keyword or the text of each search result If the length is greater than the preset fixed length, part of the text greater than the preset fixed length is deleted, and if it is less than the preset fixed length, an empty character is added at the end of the text to make the length of the text equal to the preset fixed length.
  • the preset fixed length is, for example, 15, and the preset fixed length corresponding to the search keyword and the search result may also be different, which is not limited in the embodiments of the present application, and can be set according to actual conditions.
  • the traditional word segmentation method mainly has two methods.
  • One method is to train the word segmentation tool according to the existing Chinese corpus, and then segment the text based on the trained word segmentation tool. Since this method pays more attention to the semantics of the context, the result of word segmentation is more in line with the user's natural understanding of the text and is suitable for extracting semantic features.
  • this method is not suitable for exact matching, mainly because the exact matching is only concerned with the degree to which two texts are completely consistent. If the word segmentation results obtained in this way are used as input, it may produce a situation where an exact match cannot be generated.
  • the search result can only be searched by entering a specific search keyword. If it is too small, it will affect the accuracy of the search. Therefore, in this embodiment of the present application, a multi-granular word segmentation method is adopted. For high-frequency words, the semantics of the context is considered, the high-frequency words are divided as a whole, and the exact matching requirements are considered.
  • High-frequency words are divided by words, that is, fixed-length 1 is used for word segmentation, so that semantic and precise matching can be considered comprehensively, and different word segmentation strategies are adopted for different parts of the text, that is, multi-granular word segmentation, to meet the requirements of each part of the text , Can more effectively mine the information in the text from multiple levels, and is more suitable for short text matching in multiple search demand scenarios.
  • Step 3 Calculate the exact match score s 1 of the search keywords and search results.
  • search keywords and the search results can be divided into 15 participles, for example, the search keywords are divided into q 1 ,...,q 15 , and the search results are divided into d 1 ,...,d 15 . specifically:
  • the relative positional relationship of each word segmentation in the text word segmentation result is mined, and the text word order is considered, which can improve the accuracy of accurate matching.
  • the exact matching information between the texts is directly mined according to the interaction matrix, and the influence of the relative position is ignored, so that there is a possibility that the exact matching is not accurate.
  • the interactive features are input into the pre-trained first convolutional network to obtain the convolutional features corresponding to the interactive features output by the first convolutional network after feature extraction of the interactive features, and flattened into a vector to obtain the interactive features Vector representation of the corresponding convolution feature.
  • the first convolution network includes 5 layers of convolution, input interactive features, interactive feature size is 1*255*255, after 5 layers of convolution, and finally output 512*7*7 size matrix, that is, 512 channels
  • the number of channels 512 can also be set according to actual conditions, and is not limited in the embodiments of the present application.
  • the first layer of convolution first passes through 32 two-dimensional convolution matrices (2D-conv) with a size of 3*3 convolution kernels, and then undergoes normalization (batchnorm) Processing, and finally through 2D-max pooling with a window size of 2*2.
  • 2D-conv two-dimensional convolution matrices
  • normalization normalization
  • 2D-max pooling with a window size of 2*2.
  • each layer increases the number of convolution kernels, in order 64, 128, 256, 512, the rest of the operation is the same as the first layer convolution. In this way, after 5 layers of convolution, the 7*7 matrix in 512 dimensions is finally obtained.
  • the convolution features corresponding to the interactive features obtain an exact matching score s 1 .
  • the vector representation of the convolution feature corresponding to the interactive feature is input to the pre-trained first fully connected network; based on the first fully connected network, the convolution feature is mapped to the first preset vector space to obtain the first full Connect the one-dimensional vector output from the network, and use the output one-dimensional vector as the exact match score of the corresponding search result.
  • FIG. 5 it is a schematic diagram of a structure of a first fully connected network that exactly matches in an embodiment of the present application.
  • the first fully-connected network includes three fully-connected layers.
  • the input is a vector with a size of 25,088.
  • a one-dimensional vector with a size of 1 is finally output.
  • layer 1 full connection first go through a fully connected layer (dense) with a dimension of 4096, and then perform dropout processing operations; for layer 2 full connection, first go through a dimension with a dimension of 4096, and then , Perform the dropout processing operation; for the third layer full connection, first go through the fully connected layer with dimension size 1, dropout processing operation, and finally map to a vector of size 1.
  • the convolutional features output by the first convolutional network are mapped to feature vectors of fixed length.
  • Step 4 Calculate the semantic matching weight vector.
  • the semantic matching weight vector represents the influence of the relative position information of each participle in the exact matching on the semantic matching. specifically:
  • the number of convolution kernels of the second convolutional network is 16, and the kernel size is 15, according to the interaction matrix M obtained in step 3, the interaction matrix M is passed through the one-dimensional second convolution network , Perform a one-dimensional convolution operation, and flatten it into a vector to get its vector representation.
  • the size of the interaction matrix M is 15*15. After passing through the second convolution network, a 15*16 matrix is output and can be flattened into a vector of size 240.
  • the second fully-connected network includes a fully-connected layer with an input dimension of 240 vectors. After spatial transformation mapping of the second fully-connected network, a vector with a dimension of 64 is output, that is, a semantic matching vector w ⁇ R 64 is obtained.
  • the semantic matching weight vector extracted from the interaction matrix can satisfy that when the search request is more addressable, the semantic matching weight vector is lower, which can reduce the semantic matching to the final similarity result.
  • Influence when the search request is more biased towards the general demand, the semantic matching weight vector is higher, which can reduce the impact of exact matching on the final similarity result, so that the various parts of the model can be adaptively adjusted to the final similarity result according to different search needs.
  • the effect of improving the accuracy of similarity calculations makes the search results that are ranked higher in the final order more satisfy user needs.
  • Step 5 Calculate the semantic matching representation vector of search keywords and search results.
  • segmentation results corresponding to the search keywords are q′ 1 ,...,q′ 15
  • segmentation results corresponding to the search results are d′ 1 ,...,d′ 15 .
  • Words with search results represent the matrix.
  • the preset word vector model is Word2vec, and of course it may be other word vector models, which is not limited in the embodiments of the present application. For example, if you set the dimension of the mapped word vector to 300, each participle in the search results and the participle results corresponding to the search results can be mapped to a word vector of size 300. If there are 15 participles in the participle result, the final search
  • the word representation matrix corresponding to keywords is Q ⁇ R 15 ⁇ 300
  • the word representation matrix corresponding to search results is T ⁇ R 15 ⁇ 300 .
  • Each word representation matrix is input into a pre-trained third convolution network to obtain the convolution features corresponding to the matrix output by the third convolution network after feature extraction of each word representation matrix.
  • the number of convolution kernels of the third convolutional network is 32, and the kernel is 3, then the search keywords and the word representation matrix of the search results, that is, Q and T, go through the third convolution network, respectively.
  • One-dimensional convolution operation and flattening into vectors can obtain vectors of size 32, respectively.
  • FIG. 6 it is a schematic structural diagram of a third fully connected network for semantic matching in an embodiment of the present application.
  • the third fully-connected network includes a 2-layer fully-connected layer.
  • the input is a vector of size 32.
  • the vector of size 32 is finally output.
  • the first layer is fully connected, first through the dimension size of 32 first, and then, the dropout processing operation; for the second layer full connection, first through the dimension size of 32, then, the dropout processing operation, and finally output A vector of size 32 that matches the desired representation.
  • the search keywords and the semantic representation vectors of the search results obtained through the third fully connected network are respectively f q , f t ⁇ R 32 .
  • the two semantic representation vectors f q , f t of the merged search keywords and search results are merged, and the final semantic representation vector after merged splicing is f ⁇ R 64 .
  • Step 6 Calculate the semantic matching score of search keywords and search results.
  • the combined semantic representation vector and the corresponding semantic matching weight vector are subjected to a dot product operation to obtain a semantic matching score of the search result.
  • the semantic matching scores of search keywords and search results are: among them Represents vector dot product operation.
  • Step 7 Calculate the similarity between search keywords and search results.
  • s 1 is the exact match
  • s 2 is the semantic matching score.
  • a new similarity model algorithm is provided.
  • an exact matching interaction matrix is calculated, and considering the influence of text word order and context, the relative positional relationship of each word segmentation in the word segmentation result is determined and determined
  • Interactive features based on the interaction matrix to determine the exact matching score, can deeply mine the structural information of the text, make the exact matching result more in line with user needs, and introduce the exact matching interaction matrix into the semantic matching part, determine the semantic matching weight vector, according to the semantic matching
  • the weight vector and the semantic representation vector are used to determine the semantic matching score.
  • the semantic matching weight vector can be adjusted according to the user's different search needs to improve the accuracy and reliability of the similarity calculation, so that the final similarity is more in line with the user's search needs. Sorting accuracy.
  • the search result processing method in the embodiment of the present application is performed by a similarity model, and the search keyword and each search result are processed through the similarity model to obtain the similarity between each search result and the search keyword. Therefore, it is very important whether the similarity model is accurate. You need to train the similarity model before applying it, and usually a good model requires good training samples.
  • an embodiment of the present application also provides a method for obtaining more reliable training samples, processing the original data to obtain more reliable training samples, and based on different training samples, provides a multi-objective training optimization
  • the method of similarity model, training and optimizing the similarity model from multiple scenes can improve the accuracy of the similarity model and meet the user's similarity calculation under different search needs.
  • the training process of the similarity model in the embodiment of the present application Specific instructions:
  • the training process is usually performed on the background server. Because the training of each module of the model may be complicated and the calculation amount is large, the training process is implemented by the background server, so that the trained model and results can be applied to each The intelligent terminal realizes the similarity calculation and sorting of search results.
  • FIG. 7 it is a schematic flowchart of a similarity model training method in an embodiment of the present application.
  • the method includes:
  • Step 700 Obtain the initial training sample set.
  • step 700 it specifically includes:
  • the original search click record set is obtained; wherein, each original search click record in the original search click record set includes at least a search keyword, an exposure list, and a click result, and the click result is the clicked search result.
  • the original record data can be obtained from the original user search click behavior log.
  • each record is composed of a search keyword, an exposure list, and a user click result.
  • the exposure list includes multiple search results.
  • the user enters the search keyword "A" to obtain an exposure list, that is, multiple search results.
  • the user can click one of the search results according to the demand, that is, the search result that is clicked is used as the click result.
  • the acquired original search click record set is usually relatively noisy, and the user clicks on a certain search result.
  • the search result that is not necessarily clicked is the search result required by the user. If the original record data is directly used for similar
  • the degree of model training is relatively noisy and affects the reliability of the final trained model. Therefore, certain rules are set in the embodiments of the present application to filter and filter the original recorded data, and the obtained data has more practical value.
  • the original search click record set is filtered, and the original search click record that matches the preset rule is filtered out, and the initial training sample set is obtained according to the filtered original search click record set.
  • each initial training sample in the obtained initial training sample set is consistent with the data format of the original search click record, that is, each training sample in the initial training sample set also includes at least a search keyword, an exposure list, and a click result.
  • FIG. 8 it is a flowchart of an initial sample set acquisition method in an embodiment of the present application. It can be seen that for the original search click record set, the original search click record set is filtered, and there are three main preset rules:
  • the first aspect filtering based on search behavior.
  • the preset rule is: the search keyword is the same as the click result, and/or the search keyword does not meet the preset length range. For example, if the search keywords are exactly the same as the click results, it means that the search results are completely consistent. The record is not valuable for training the similarity model, and the corresponding original search click records are filtered out. For another example, if the search keywords do not meet the preset length range, such as being too long or too short, for similarity model training, it will increase the difficulty of data processing and also affect the accuracy, so it is filtered out.
  • the preset rule is: the number of click results corresponding to the search keyword is greater than the first preset number, or the position of the click result in the exposure list is behind the preset ranking, or the original search with the same search keyword but different corresponding click results The number of click records is greater than the second preset number.
  • a user searches for a certain search keyword and clicks multiple search results in the returned exposure list, which corresponds to multiple click results, that is, a user's single search corresponds to multiple clicks, it indicates that the user's click results are comparable Low, the correlation between click results and search keywords may not be high, so the reliability of the record is low, which is not conducive to the training of similarity models.
  • the click result clicked by the user may not be related to the search keyword, and the user may just search randomly or click a search result at will , So filtered.
  • the number of original search click records with the same search keywords but different corresponding click results is greater than the preset number, that is, the search keywords of multiple records are the same, but the corresponding click results are different, it can be considered that different users are targeting the same search Keyword, corresponding to too many click results, it means that there are cases where the click result and the search keyword may be low in the multiple records, or that the search keyword may not have a strong search result,
  • these records will affect the accuracy of the training of the similarity model, so in this embodiment of the present application, these record data are filtered out.
  • the preset rule is: there is no click result in the original search click record, or the number of search results in the exposure list does not meet the preset number range. For example, after a user searches for a certain search keyword and there is no click behavior, there is no click result in the original search click record, and it cannot be used for similarity model training and needs to be filtered out. For another example, if the length of the exposure list in the original search click record is too long or too short, that is, too many or too few search results are obtained, it is also not conducive to model training and filtering. In this way, by filtering the original search click data to remove some data that is not conducive to the training of the similarity model, a relatively reliable initial training sample set can be obtained from a large number of original search click records.
  • Step 710 Construct a binary data training set according to the initial training sample set, and train the initialized similarity model according to the binary data training set to obtain the trained first similarity model.
  • each dual data in the dual data training set includes at least a search keyword, a search result, and a label indicating whether the search keyword is similar to the search result.
  • the binary data is pairwise data
  • a pairwise data includes two texts and a label, for example, the label is 0 or 1, and 0 or 1 indicates whether the two texts are similar.
  • a pairwise data is (A, B, 1), it means that text A and text B are similar.
  • the goal of training a similarity model based on binary data is that the principle is to assume that the user search results are of two types. One type is to satisfy the user's search keywords from the perspective of text matching. The label with a click result similar to the search keyword is set to 1. , Another category is unsatisfactory. The label whose click result is not similar to the search keyword is set to 0.
  • the data of the initial training sample set can be divided into two categories.
  • a binary classification model can be used for classification.
  • the binary classification model can be optimized according to the traditional training method of supervised classification, so that the initial training sample set can be divided into two categories based on the optimized binary classification model to construct a binary data training set.
  • the data format of each initial training sample in the initial training sample set can be ⁇ search keywords, click results, search results that have not been clicked ⁇ , and there can be multiple search results that have not been clicked. Search results other than the results, of course, the data form of the initial training sample is not limited, in this embodiment of the present application only need to be able to learn from the initial training sample search keywords, click results, and unclicked search results in the exposure list That's it.
  • step 710 when executed, it specifically includes:
  • a training set of binary data is constructed. Specifically:
  • the search matching result of the unclicked search result with the text of the search keyword is not less than the search result of the second preset threshold, never Filter the determined search results from the clicked search results; if the filtered unclicked search results are not empty, the search keywords and click results form a positive sample pair, the label is 1; from the filtered unchecked A search result is randomly selected from the clicked search results, and the search keyword and a randomly selected search result constitute a negative sample pair with a label of 0.
  • the first preset threshold and the second preset threshold can be set according to actual conditions, and are not limited in the embodiments of the present application. That is, you can use label 1 to indicate similarity, and label 0 to indicate dissimilarity.
  • a positive sample pair and a negative sample pair can be generated according to each initial training sample in the initial training sample set, thereby obtaining a binary data training set.
  • the initialized similarity model is trained according to the binary data training set, and the first similarity model after training is obtained.
  • the input of the initialized similarity model is a binary data training set
  • the output is a similarity result.
  • the objective function of the training initialized similarity model is the output of the similarity result and the label of the binary data training set.
  • the loss function is minimized.
  • the Adam optimization algorithm and binary cross-entropy loss optimization objective can be used to continuously optimize and train the similarity model until convergence, that is, the loss function continues to decline and tends to be stable.
  • the dual data training set is data indicating whether the search keyword and the search result are similar, and the data is divided into two types of similarity and dissimilarity. That is, the similarity model is trained based on the classification optimization goal, which can continuously optimize the accuracy of the similarity model in judging the similarity of the search results and the search keywords.
  • Step 720 Construct a training set of triple data according to the initial training sample set, and train a first similarity model according to the training set of triple data to obtain a second similarity model after training.
  • each triplet data in the triplet data training set includes at least a search keyword, a first search result, and a second search result, and the similarity between the search keyword and the first search result is greater than the search keyword and the second Similarity of search results.
  • step 720 the training process of step 720 is based on the training results of step 710 described above.
  • step 710 the data is divided into two categories, similar and dissimilar, to train the similarity model, but there are a large number of classification criteria that cannot be simply used in practice. The situation of division.
  • the first similarity model is further trained using the ranking optimization goal to construct a triple data training set, and the first similarity model is trained again based on the triple data training set.
  • the principle of this goal is to The similarity of the text to the same text is different to optimize the ranking results.
  • the triplet data is (search keywords, doc1, doc2), and it is known that the matching degree of (search keywords, doc1) is higher than (search keywords, doc2) matching degree, through such triple data training, the similarity model can finally make (search keywords, doc1) the calculated similarity score higher than (search keywords, doc2), that is, it can make In the search result corresponding to the search keyword, doc1 is ranked higher than doc2.
  • step 720 When step 720 is specifically executed, it specifically includes:
  • the initial training sample set constructs a triple data training set. Specifically: for each initial training sample in the initial training sample set, perform the following processing:
  • an initial training sample includes search keywords, an exposure list, and clicked click results in the exposure list.
  • the position of the click result in the exposure list means that the click result is ranked in the exposure list.
  • the position of the click result corresponding to the i-th initial training sample in the exposure list is p i .
  • the frequency of each position in the exposure list being clicked can be:
  • each search result i is clicked on position j in the corresponding exposure list.
  • the click result with the highest confidence is the first search result in the triple data
  • the search result with the least confidence that has not been clicked is the second search result in the triple data.
  • the first similarity model is trained according to the triple data training set, and the second similarity model after training is obtained.
  • the input of the first similarity model is a training set of triple data
  • the output is two similarity results
  • the objective function of training the first similarity model is the relationship between the size of the two similarity results and the triplet
  • the loss function between the two similarity magnitude relationships of the data training set is minimized.
  • the Adam optimization algorithm and triplet hinge loss optimization goal can be used to continuously optimize and train the similarity model until convergence, that is, the loss function keeps decreasing and tends to be stable.
  • the triple hinge loss is a loss function in deep learning. It is used to train samples with small differences, including anchor examples, positive examples and negative examples. By optimizing the distance between the anchor example and the positive example The target smaller than the distance between the anchor example and the negative example realizes the similarity calculation of the samples.
  • the positive example samples anchor distance d +, and the negative samples anchor exemplary distance d -
  • triplet hinge loss max (0 , d + -d - + ⁇ ), where [alpha] It is a preset non-negative parameter.
  • the triple data training set is data indicating that the two search results are similar to the search keywords.
  • the first similarity model is trained again, that is, based on the ranking Optimizing the target training similarity model can make the similarity results obtained by the similarity model more accurate for ranking.
  • the similarity model can be trained separately according to the constructed training set, and two targets can be optimized at the same time, and each target can be improved and optimized in combination with the current search scenario, so that the similarity model obtained in the final training can take into account the user's search.
  • the general requirements and addressing requirements meet different search scenarios, and the similarity results more meet the actual needs of users.
  • FIG. 9 it is a flowchart of another similarity model training method in the embodiment of the present application.
  • Step 900 Obtain the original search click record set.
  • Step 901 Filter the original search click record set according to preset rules.
  • Step 902 Obtain the filtered initial training sample set.
  • Step 903 Construct a binary data training set.
  • each dual data in the dual data training set includes at least a search keyword, a search result, and a label indicating whether the search keyword is similar to the search result.
  • Step 904 Based on the classification optimization goal, obtain the trained first similarity model.
  • the initialized similarity model is trained to obtain the first similarity model.
  • Step 905 Construct a training set of triple data.
  • each triplet data in the triplet data training set includes at least a search keyword, a first search result, and a second search result, and the similarity between the search keyword and the first search result is greater than the search keyword and the second Similarity of search results.
  • Step 906 Train the first similarity model based on the ranking optimization goal.
  • the first similarity model is trained to obtain the trained second similarity model.
  • Step 907 Obtain the similarity model obtained by the final training.
  • the second similarity model obtained after training is the similarity model obtained in the final training.
  • each target can be optimized separately in a phased training mode, or each target can be optimized in a simultaneous training mode or an alternating training mode. This is not limiting.
  • steps 900-902 are the first step, that is, obtaining the initial training sample set
  • steps 903-step 904 are the first
  • the second step is to train based on the first optimization goal, that is, to train the similarity model based on the classification optimization goal
  • step 905-step 906 is the third step, to train based on the second optimization goal, that is, based on the ranking optimization goal for the first optimization goal
  • the similarity model after training is trained again, and finally the similarity model after the second training with two optimization targets is obtained.
  • the training process of the similarity model is mainly to construct the training set.
  • FIG. 10 it is a flowchart of a method for constructing a training set of binary data in an embodiment of the present application, including:
  • Step 1000 Input the initial training sample set.
  • Step 1001 determine whether the initial training sample set is empty, if yes, perform step 1009, otherwise, perform step 1002.
  • Step 1002 Take an initial training sample.
  • Step 1003 determine whether the text matching similarity of the search result of the search keyword is greater than the first preset threshold, and if so, perform step 1004; otherwise, return to step 1001.
  • the text matching similarity here is text literal similarity.
  • the first preset threshold for example 0.01, can be set according to the actual situation.
  • Step 1004 Determine the search results whose unmatched search results have a text matching similarity to the search keyword of not less than the second preset threshold, and filter out the determined search results from the search results that have not been clicked.
  • the second preset threshold is 0.8, for example, and can be set according to actual conditions.
  • Step 1005 Determine whether the filtered unclicked search result is not empty. If yes, perform step 1006 and step 1007 respectively; otherwise, perform step 1001.
  • Step 1006 Form a positive sample pair between the search keyword and the click result.
  • Step 1007 randomly select a search result from the filtered unclicked search results, and form a negative sample pair with the search keyword and the randomly selected search result.
  • Step 1008 Output the positive sample pair and the negative sample pair to the binary data training set.
  • Step 1009 Output the final training set of binary data.
  • each initial training sample if a positive sample pair and a negative sample pair can be constructed, the output is merged into the binary data training set, and each initial training sample in the initial training sample set is processed in turn until all After the initial training samples are processed, the final binary data training set is obtained.
  • FIG. 11 it is a flowchart of a method for constructing a training set of triple data in an embodiment of the present application, including:
  • Step 1100 Input the initial training sample set.
  • Step 1101 determine whether the initial training sample set is empty, if yes, perform step 1008, otherwise, perform step 1102.
  • Step 1102 Take an initial training sample.
  • Step 1103 Determine whether the confidence of the click result is the highest in the exposure list, and the confidence of the click result is not less than the preset score value. If so, perform step 1104; otherwise, return to step 1101.
  • the pre-score value for example, is 1, which can be set according to the actual situation.
  • Step 1104 Determine the search result with the least confidence among the search results that have not been clicked.
  • Step 1105 determine whether the confidence level of the search result with the lowest confidence level is less than the preset score value, and if so, perform step 1106; otherwise, perform step 1101.
  • Step 1106 The search keywords and the determined click result with the highest confidence and the determined unclicked search result with the lowest confidence are combined to form triple data.
  • Step 1107 Output the formed triple data to the triple data training set.
  • Step 1108 Output the final training set of triples data.
  • the similarity model is trained, and finally the similarity model obtained by training can be practically applied.
  • the trained similarity model can be provided to online services for application
  • the search results are sorted in the following scenarios:
  • Step 1 Enter user search keywords and all recalled search results.
  • Step 2 Input the search keywords and each search result into the similarity model in sequence, and calculate the similarity between the search keywords and each search result separately.
  • Step 3 The similarity result is provided as a feature to the ranking model, and finally the ranking model obtains the ranked search results based at least on the similarity result.
  • the ranking model is not limited.
  • the ranking model can rank the search results in order according to the similarity, from large to small, and can also combine other features or factors at the same time to finally obtain the sorted search results.
  • the training samples of the similarity model are more reliable, and the training is based on the multi-target training mode, and the similarity model obtained is also more reliable.
  • the similarity model algorithm combines precise matching and semantic matching, and considers For different search needs, a semantic matching weight vector is set, so that the similarity calculated by the similarity model is more accurate, and the search results sorted based on similarity are more in line with the user's search needs, making the top search results more satisfying the user needs , Improve user search efficiency and reduce time.
  • the search result processing method in the embodiment of the present application may be applied to the vertical search in the search-and-search function in WeChat, for example, the search scene of the official account in search-and-search.
  • the search-and-search function can enable users to search information such as circle of friends, articles, public accounts, novels, music, expressions, etc. based on search keywords.
  • FIG. 12 it is a schematic diagram of the search-search interface in a specific application scenario in the embodiment of the present application.
  • the left diagram of FIG. 12 it is a schematic diagram of the main entrance interface of WeChat-search.
  • "Search one search” function you can jump to the interface corresponding to "search one search”, as shown in the right figure of Figure 12, is a schematic diagram of the main interface after clicking "search one search" to enter, the user can in this main interface Enter search keywords to search for information such as circle of friends, articles, and public accounts.
  • FIG. 13 is a schematic diagram of a search result display interface in a specific application scenario in an embodiment of the present application
  • a user enters a search keyword “jump one jump” to obtain corresponding search results, as shown in the left figure of FIG. 12
  • the current search result type in the box where "all” means including various search result types, and "applet”, "public account”, “expression”, etc. means that only contains a single search result type, that is, vertical search scene .
  • click on the “public account” to obtain the search results of the public account on “jump one hop”, and each search result obtained is an exposure list.
  • the user’s search needs involve many aspects, and the final click behavior is also affected by many factors. How to accurately sort the search results recalled by the search keywords and display them to the user is the most important. Therefore, this application is implemented
  • the similarity model is mainly used to calculate the similarity between the search keywords and the search results, and sort them.
  • the sorted search results are displayed, which can better mine the user's search needs and meet the user's search needs.
  • the search results are ranked higher in the final display for user access and improved efficiency.
  • the embodiments of the present application are not limited to vertical search scenarios such as public account search and applet search in WeChat's search and search, but can also be applied to other scenarios, such as QQ public account search and today's headlines No. search, life search and applet search in Alipay, mobile terminal Android market, App store (App store) and other APP search, video site and video search in short video APP, etc., are not limited in the embodiments of this application.
  • FIG. 14 it is a schematic diagram of a search result display interface after sorting in specific application scenarios in an embodiment of the present application.
  • the user conducts a vertical search based on the "search one search" function of the WeChat APP.
  • the public account vertical search scenario the user wants to search the "Borfan Time” public account, but does not know which word is specific, or has entered the wrong As shown in FIG. 14, when inputting search keywords, enter “Bofan Time”, and then, based on the search result processing method in the embodiment of the present application, after processing the input search keywords and each search result, Get the similarity of each search result and the search keyword, and sort them. According to the sorted search results, display each search result in order from front to back.
  • the final ranking is "Berfan Time", which means that even if the user enters a certain word in the search keyword, a more similar search result can be well recognized and displayed to the user, with good fault tolerance sexuality, the top search results are more in line with user search needs, more accurate.
  • performance analysis is performed.
  • the goal is to determine whether the two texts have the same meaning.
  • the accuracy of matching between search results and search keywords and the index F1-score that measures the accuracy of the binary classification model are counted.
  • the specific values are shown in Table 1 below. It can be seen that the methods described in the embodiments of the present application have significantly improved accuracy and F1-score for these two data sets.
  • the metrics for performance evaluation include: average precision (MAP), average reciprocal ranking (MRR), and normalized converted cumulative gain (NDCG).
  • MAP average precision
  • MRR average reciprocal ranking
  • NDCG normalized converted cumulative gain
  • Table 3 gives corresponding performance data. It can be seen that in the two metrics of MAP and MRR, the method described in the embodiment of the present application can achieve an improvement of more than 45%; while in the NDCG, the method described in the embodiment of the present application can obtain a higher 70% improvement.
  • the search result processing apparatus specifically includes:
  • the obtaining module 1500 is used to obtain each search result according to the search keywords
  • the exact matching module 1510 is used to obtain an exact matching score of the search result for each search result
  • the semantic matching module 1520 is configured to determine, for each search result, a semantic matching weight vector of the search result, and a semantic representation vector of the search keyword and the search result, and match the semantic representation vector with the semantic representation Weight vector to obtain the semantic matching score of the search result;
  • the obtaining module 1530 is configured to obtain, for each search result, the similarity between the search result and the search keyword according to the exact matching score and the semantic matching score.
  • the search result processing apparatus further includes:
  • the model training module 1540 is used to create and train a similarity model in advance; wherein, the precise matching module, the semantic matching module, and the obtaining module process according to the similarity model obtained by the model training module.
  • the model training module 1540 includes:
  • the obtaining module 1600 is used to obtain an initial training sample set
  • the first training module 1610 is configured to construct a binary data training set according to the initial training sample set, and train an initialized similarity model according to the binary data training set to obtain a trained first similarity model ;
  • the second training module 1620 is configured to construct a training set of triple data according to the initial training sample set, and train the first similarity model according to the training set of triple data to obtain the second Similarity model.
  • FIG. 17 it is a schematic structural diagram of a terminal in an embodiment of the present application.
  • the terminal specifically includes:
  • the first receiving module 1700 is used to receive the input search keywords
  • the sending module 1710 is configured to send the received search keywords to the server, so that the server executes any of the above search result processing methods, obtains the similarity between each search result and the search keyword, and according to each search result and The similarity of the search keywords to obtain sorted search results;
  • the second receiving module 1720 is configured to receive the sorted search results returned by the server;
  • the display module 1730 is used to display the sorted search results.
  • FIG. 18 a schematic structural diagram of an electronic device in an embodiment of the present application.
  • An embodiment of the present application provides an electronic device.
  • the electronic device may include a processor 1810 (Center Processing Unit, CPU), a memory 1820, an input device 1830, an output device 1840, and the like.
  • the memory 1820 may include read-only memory (ROM) and random access memory (RAM), and provide the processor 1810 with program instructions and data stored in the memory 1820.
  • the memory 1820 may be used to store the program of the training sample generation method in the embodiment of the present application.
  • the processor 1810 calls the program instructions stored in the memory 1820, and the processor 1810 is configured to execute any search result processing method in the embodiments of the present application according to the obtained program instructions.
  • a computer-readable storage medium is provided on which a computer program is stored, and when the computer program is executed by a processor, the search result processing method in any of the above method embodiments is implemented.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction device, the instructions
  • the device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及搜索结果处理方法、装置、终端、电子设备及存储介质,所述方法包括:根据搜索关键词获取各搜索结果;针对每个搜索结果,执行如下处理:获得该搜索结果的精确匹配分数;确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。

Description

搜索结果处理方法、装置、终端、电子设备及存储介质
本申请要求于2018年11月29日提交中国专利局、申请号为201811444224.6、申请名称为“一种搜索结果处理、相似度模型训练方法及装置”的中国专利申请的优先权。
技术领域
本申请涉及计算机技术领域,尤其涉及搜索结果处理方法、装置、终端、电子设备及存储介质。
背景技术
随着互联网的发展,用户通过查询搜索,就可以获取到所需的信息,但是,用户搜索需求多样,并且互联网中资源非常多,通过搜索后的搜索结果可能也会很多,为减少用户时间和便于查看,就需要对搜索结果进行排序,将更符合用户需求的排序靠前。
传统技术中,用户输入搜索关键词后,可以根据搜索关键词获取相应的搜索结果列表,然后通过算法模型进行文本相似度计算,根据相似度得分,进行排序,并返回排序后的搜索结果列表。但是,传统技术中相似度计算的算法模型,通常都是针对单一搜索需求的应用场景,例如泛需求,没有考虑用户不同的搜索需求,并且忽略了文本语序和上下文的影响,相似度得分较高的搜索结果并不一定符合用户需求,从而导致根据相似度得到排序后的搜索结果列表中,排序靠前的搜索结果并不一定是用户最满意或最想要的结果。
发明内容
本申请实施例提供搜索结果处理方法、装置、终端、电子设备及存储介质,以解决传统技术中搜索结果相似度计算不准确,无法满足用户不同搜索需求的问题。
本申请一个实施例提供了一种搜索结果处理方法,由电子设备执行,包括:
根据搜索关键词获取各搜索结果;
针对每个搜索结果,执行如下处理:
获得该搜索结果的精确匹配分数;
确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;
根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。
本申请另一个实施例提供了一种搜索结果处理装置,包括:
获取模块,用于根据搜索关键词获取各搜索结果;
精确匹配模块,用于针对每个搜索结果,获得该搜索结果的精确匹配分数;
语义匹配模块,用于针对每个搜索结果,确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;
获得模块,用于针对每个搜索结果,根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。
本申请另一个实施例提供了一种终端,包括:
第一接收模块,用于接收输入的搜索关键词;
发送模块,用于将接收到的搜索关键词发送给服务器,以使服务器执行上述任一项搜索结果处理方法,获得各搜索结果与所述搜索关键词的相似度,并根据各搜索结果与所述搜索关键词的相似度,获得排序后的搜索结果;
第二接收模块,用于接收服务器返回的排序后的搜索结果;
展示模块,用于展示排序后的搜索结果。
本申请另一个实施例提供了一种电子设备,包括:
至少一个存储器,用于存储程序指令;
至少一个处理器,用于调用所述存储器中存储的程序指令,按照获得的程序指令执行上述任一种搜索结果处理方法。
本申请另一个实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一种搜索结果处理方法。
附图说明
图1为本申请实施例中各实施例的应用架构示意图;
图2为本申请实施例中搜索结果处理方法流程图;
图3为本申请实施例中相似度模型结构示意图;
图4为本申请实施例中精确匹配的第一卷积网络结构示意图;
图5为本申请实施例中精确匹配的第一全连接网络结构示意图;
图6为本申请实施例中语义匹配的第三全连接网络结构示意图;
图7为本申请实施例中一种相似度模型训练方法流程图;
图8为本申请实施例中初始样本集获取方法流程图;
图9为本申请实施例中另一种相似度模型训练方法流程图;
图10为本申请实施例中构建二元组数据训练集方法流程图;
图11为本申请实施例中构建三元组数据训练集方法流程图;
图12为本申请实施例中具体应用场景中搜一搜界面示意图;
图13为本申请实施例中具体应用场景中搜一搜结果展示界面示意图;
图14为本申请实施例中具体应用场景中排序后搜索结果展示界面示意图;
图15为本申请实施例中搜索结果处理装置结构示意图;
图16为本申请实施例中另一种搜索结果处理装置结构示意图;
图17为本申请实施例中终端结构示意图;
图18为本申请实施例中电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,并不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参阅图1所示,为本申请实施例中各实施例的应用场景架构示意图,至少包括终端、服务器。终端可以是智能手机、平板电脑、便携式个人计算机、智能电视等任何智能设备,终端上可以安装并运行各种可以信息搜索的应用程序(Application,APP),例如浏览器、社交应用等,用户可以通过终端的APP搜索所需的信息。
服务器可以为终端提供各种网络服务,对于不同的终端或终端上的应用程序,服务器可以认为是提供相应网络服务的后台服务器。其中,服务器可以是一台服务器、若干台服务器组成的服务器集群或云计算中心,对此并不进行限制。终端与服务器之间可以通过互联网相连,实现相互之间的通信。
本申请实施例中,搜索结果处理方法可以应用于终端或服务器,并不进行限制,而相似度模型训练方法,可以应用于服务器。并且,本申请实施例中,主要针对例如垂直搜索应用场景,在该应用场景下,用户通常有多种搜索需求,例如泛需求和寻址需求等,当然并不仅限于该应用场景。
其中,垂直搜索是针对某一个行业的专业搜索引擎,是搜索引擎的细分和延伸,是对 库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户,例如,在搜一搜中具体指某特定类型结果的搜索,例如公众号搜索、小程序搜索等。
其中,泛需求是指用户搜索目的是寻找满足某类需求的目标,而并非某个固定目标,通过输入需求关键词来获取结果。寻址需求是指用户搜索目的是寻找某个固定的目标,但无法确定其准确名称,需要通过相似关键词来进行搜索以获取结果。
为解决传统技术中搜索结果相似度计算和排序不准确,无法满足用户不同搜索需求的问题,本申请实施例提供的实施方式,将搜索关键词与各搜索结果分别进行精确匹配和语义匹配,进行精确匹配时,不仅考虑精确匹配,还结合语义信息,并且进行语义匹配时,确定各搜索结果的语义匹配权重向量,充分挖掘用户的搜索需求,根据用户不用的搜索需求,确定语义匹配权重向量,来调整语义匹配和精确匹配对最终相似度的影响,从而根据语义匹配权重向量,确定语义匹配分数,最终根据精确匹配分数和语义匹配分数,来确定各搜索结果的相似度,更加适用于用户多种搜索需求的场景,使得相似度较高的搜索结果更加符合用户需求,提高准确性,从而使得相似度较高的搜索结果更加满足用户需求。
需要说明的是,本申请实施例中排序操作也可以在终端或服务器执行,对此并不进行限制,并且终端在展示搜索结果时,可以直接根据相似度,从中获取相似度靠前的搜索结果,并按照相似度从高到低依次展示,也可以获得排序后的搜索结果,根据排序后的搜索结果,依次进行展示。
还需要说明的是,本申请实施例中的应用架构图是为了更加清楚地说明本申请实施例中的技术方案,并不构成对本申请实施例提供的技术方案的限制,对于其它的应用架构和业务应用,本申请实施例提供的技术方案对于类似的问题,同样适用。
本申请各个实施例中,以搜索结果处理、相似度模型训练方法用于图1所示的应用架构图为例进行示意性说明。
参阅图2所示,为本申请实施例中搜索结果处理方法的流程图,该方法包括:
步骤200:根据搜索关键词获取各搜索结果。
具体地,执行步骤200可以通过匹配的方式获得各搜索结果。例如,将搜索关键词进行分词,分别基于搜索关键词的每个分词进行匹配,获得各个分词匹配对应的搜索结果集,取各搜索结果集的交集,作为该搜索关键词对应的各搜索结果。
针对每个搜索结果,分别执行如下步骤210-230。
步骤210:获得该搜索结果的精确匹配分数。
执行步骤210时,具体包括:
首先,采用多粒度分词方式,分别获得搜索关键词与该搜索结果对应的分词结果。
具体地,根据预设高频词集,分别对搜索关键词和各搜索结果进行分词。具体而言,若搜索关键词和该搜索结果中包含预设高频词集中的高频词,则将高频词划分出,并针对文本的其余部分按字进行划分,分别获得搜索关键词与各搜索结果对应的分词结果。
例如,若“医院”为高频词,则文本“ABC医院”的分词结果为[“A”,“B”,“C”,“医院”],这样,匹配时可以将“医院”作为整体来匹配,符合语义要求,将“A”、“B”、“C”按字精确匹配。
其中,预设高频词集可以先使用分词工具对数据进行分词,并统计词频,根据词频从中筛选出高频词。
这样,上述实施例中,针对中文文本的特点,综合考虑精确匹配和语义,采用不同的分词策略,分别以高频词和字为颗粒度,即多粒度分词方式,来满足各个部分的要求。
然后,将搜索关键词与该搜索结果对应的分词结果进行精确匹配,获得搜索关键词与搜索结果的交互矩阵。
具体地,以搜索关键词与一个搜索结果进行精确匹配为例,将搜索关键词对应的各分词与搜索结果对应的各分词一一进行精确匹配,可以获得一个精确匹配的交互矩阵,例如交互矩阵为M,其中M ij=[q i==d j]。
例如,搜索关键词的分词结果为[“a”,“ab”,“c”],搜索结果的分词结果为[“a”,“bc”,“c”],则将分词结果中各个分词分别进行精确匹配,例如搜索关键词的“a”分别与搜索结果的“a”、“bc”、“c”进行精确匹配,得到的为1,0,0;然后,搜索关键词的“ab”分别与搜索结果的“a”、“bc”、“c”进行精确匹配,得到的为0,0,0;最后,搜索关键词的“c”分别与搜索结果的“a”、“bc”、“c”进行精确匹配,得到的为0,0,1;最后就可以得到一个3*3的矩阵M,如下所示:
Figure PCTCN2019121928-appb-000001
然后,根据搜索关键词与各搜索结果对应的分词结果中各分词的相对位置关系,以及交互矩阵,获得搜索关键词与该搜索结果的交互特征。
即本申请实施例中,计算出交互矩阵后,考虑搜索关键词与搜索结果的分词结果中各分词的相对位置关系的影响,这样,可以更加准确描述两个文本是否相似。
具体交互特征可以计算为:
Figure PCTCN2019121928-appb-000002
其中,D中(D ij=|i-j|),I中(I ij=[i==j]),
Figure PCTCN2019121928-appb-000003
表示克罗内克积(Kronecker product)运算,softplus为激活函数。
在本申请实施例中,若上述交互矩阵M认为是一阶交互矩阵,那么上述交互特征MF可以认为是二阶交互矩阵。相比于交互矩阵M,交互特征MF所包含的信息携带了本地语言结构特征。
最后,根据交互特征,获得该搜索结果的精确匹配分数。
具体地:1)将交互特征输入到预先训练的第一卷积网络中,获得第一卷积网络对交互特征进行特征提取后输出的卷积特征。
其中,卷积网络主要是用于不同维度特征的提取,例如,采用卷积神经网络(Convolutional Neural Networks,CNN)网络。本申请实施例中,第一卷积网络可以为二维卷积网络,包括5层卷积,具体地可以根据实际情况进行设置,本申请实施例中并不进行限制,第一卷积网络中的卷积核等参数都是预先训练出来的,满足特征维度计算的需求。
2)将交互特征对应的卷积特征,输入到预先训练的第一全连接网络。
具体地,将卷积特征进行展平,展平为其向量表示,将该向量表示输入到第一全连接网络,以便于后续全连接网络的计算。
3)基于第一全连接网络,将卷积特征映射到第一预设向量空间,获得第一全连接网络输出的一维向量,并将输出的一维向量作为该搜索结果的精确匹配分数。
其中,全连接网络主要是用于空间位置变换,本申请实施例中第一全连接网络可以包括3层全连接层,第一全连接网络中各全连接层的参数都是预先训练得到的,可以最后将卷积特征映射为1维向量,即得到精确匹配分数。
步骤220:确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数。
执行步骤220时,针对短文本匹配,在确定语义匹配分数时,需要考察词与词之间的关系。为了能够捕捉词的重要性,此时使用交互矩阵来确定语义匹配分数。
在确定该搜索结果的语义匹配权重向量时,具体为:
1)将交互矩阵输入到预先训练的第二卷积网络,获得第二卷积网络对交互矩阵进行特征提取后输出的卷积特征。
例如,第二卷积网络为1维卷积网络,包括1层卷积。
2)将卷积特征输入到预先训练的第二全连接网络。
例如,第二全连接网络包括一层全连接层,通过第二全连接网络,映射得到语义匹配权重向量。具体实现时,可以将卷积特征的向量表示输入到第二全连接网络。
3)基于第二全连接网络,将卷积特征映射到第二预设向量空间,获得第二全连接网络输出的预设维度的向量,并将输出的预设维度的向量,作为该搜索结果的语义匹配权重向量。
也就是说,本申请实施例中,将精确匹配获得的交互矩阵,引入到语义匹配部分,得到语义匹配权重向量,使得语义匹配部分能够更加合理地捕捉文本中需要关注的部分,通过训练第二卷积网络和第二全连接网络,使得语义匹配权重向量可以根据不同的搜索需求,调整语义匹配分数和精确匹配分数对最终相似度结果的影响,兼顾用户搜索泛需求和寻址需求。
然后,分别确定搜索关键词与该搜索结果的语义表示向量。
具体为:1)采用多粒度分词方式,分别获得搜索关键词与各搜索结果对应的分词结果。
具体地,和上述获取精确匹配分数时的多粒度分词方式相同,将搜索关键词和各搜索结果进行分词划分。
2)根据预设的词向量模型,分别获得搜索关键词与各搜索结果对应的分词结果中每个分词的词向量,并根据对应的分词结果中每个分词的词向量,分别获得搜索关键词与各搜索结果的词表示矩阵。
其中,预设的词向量模型,可以将分词结果中的每个分词映射为一个向量,即可以获得每个分词对应的词向量,从而分词结果中所有分词的词向量,组成相应搜索关键词或搜索结果的词表示矩阵。例如,若搜索关键词划分为15个分词,每个分词映射为一个300维词向量,则该搜索关键词的词表示矩阵为15*300大小的矩阵。
在具体应用时,可以采用word2vec模型作为上述词向量模型。该word2vec模型为浅而双层的神经网络,在word2vec中词袋模型假设下,词的顺序是不重要的,训练完成之后,word2vec模型可将每个词映射为一个向量,表示词与词之间的关系,该向量为神经网络之隐藏层。
3)分别将各词表示矩阵(即搜索关键词的词表示矩阵和该搜索结果的词表示矩阵),输入到预先训练的第三卷积网络,获得第三卷积网络对各词表示矩阵进行特征提取后输出的各词表示矩阵对应的卷积特征。例如,第三卷积网络为1维卷积网络,包括1层卷积。
4)将各词表示矩阵对应的卷积特征(例如向量表示),输入到预先训练的第三全连接网络。例如,第三全连接网络包括2层全连接层。
5)基于第三全连接网络,将各卷积特征映射到第三预设向量空间,获得第三全连接网 络输出的各预设维度向量,并将输出的各预设维度的向量,分别作为相应的搜索关键词或搜索结果的语义表示向量。
这样,预先训练第三卷积网络和第三全连接网络,从而可以根据训练后的第三卷积网络和第三全连接网络,获得搜索关键词和各搜索结果的语义表示向量。
当然,也可以采用其它方式进行训练学习文本的语义表示向量,例如,采用RNN方式,可以根据实际需求选择不同的方式,本申请实施例中并不进行限制。
最后,根据确定的语义表示向量和相应的语义匹配权重向量,获得该搜索结果的语义匹配分数。
具体为:1)将搜索关键词与该搜索结果的语义表示向量进行合并拼接,获得合并拼接后的语义表示向量。
例如,搜索关键词的语义表示向量为32维向量,搜索结果的语义表示向量为32维向量,则合并拼接后的语义表示向量为64维向量。
2)将合并拼接后的语义表示向量与该搜索结果的语义匹配权重向量,进行点积运算,获得该搜索结果的语义匹配分数。
具体地:语义匹配分数为
Figure PCTCN2019121928-appb-000004
其中
Figure PCTCN2019121928-appb-000005
表示向量点积运算,w为语义匹配权重向量,f为合并拼接后的语义表示向量。
即将语义表示向量与相应的语义匹配权重向量进行相乘,获得最终的语义匹配分数。
步骤230:根据该搜索结果的精确匹配分数和语义匹配分数,获得该搜索结果与搜索关键词的相似度。
执行步骤230时,具体包括:将精确匹配分数和语义匹配分数相加,将相加后的和,作为该搜索结果与搜索关键词的相似度。
例如,相似度为s=s 1+s 2,其中,s 1为精确匹配分数,s 2为语义匹配分数。
进一步地,本申请实施例中,上述搜索结果处理方法可以基于一个相似度模型来完成,具体地各搜索结果与搜索关键词的相似度的获取方式为:将搜索关键词和各搜索结果,输入到预先训练的相似度模型,通过相似度模型,执行上述实施例中的搜索结果处理方法后,输出各搜索结果与搜索关键词的相似度。
也就是说,本申请实施例中上述步骤200-步骤230,即搜索结果的处理方法,是通过相似度模型来实现的,基于预先训练得到的相似度模型,将搜索关键词和搜索结果输入到相似度模型中,输出搜索关键词与搜索结果的相似度。
进一步地,执行步骤230之后,获得各搜索结果与搜索关键词的相似度,就可以应用 该相似度,具体地提供了一种可能的实施方式,根据各搜索结果与搜索关键词的相似度,分别对各搜索结果进行排序,获得排序后的搜索结果。
这样,本申请实施例中,通过语义匹配权重向量,可以调整语义匹配和精确匹配对最终相似度的影响,当搜索请求更偏向于寻址请求时,语义匹配权重向量较低,减少语义匹配对最终结果的影响,当搜索请求更偏向于泛需求时,语义匹配权重向量较高,减少精确匹配对最终结果的影响,从而可以使得最终的相似度,可以满足不同的搜索需求,使得最终的排序靠前的搜索结果更加满足用户需求。
基于上述实施例可知,本申请实施例中通过创建并训练相似度模型,获得各搜索结果与搜索关键词的相似度,即上述步骤200-步骤230主要是基于相似度模型实现,因此,下面采用具体的应用场景,以应用于微信中公众号搜索场景为例,对相似度模型的算法进行具体说明。
参阅图3所示,为本申请实施例中相似度模型结构示意图,其中,以针对搜索关键词和其中一个搜索结果,即计算搜索关键与其中一个搜索结果的相似度为例进行说明。
该相似度模型的输入为搜索关键词和搜索结果,例如,搜索关键词是长度为L 1的字符串,搜索结果是长度为L 2的字符串。针对微信中公众号搜索场景,搜索结果为公众号标题。基于图3所示,主要分为以下几个步骤进行介绍:
第1步:分别对搜索关键词和搜索结果进行文本预处理。
具体包括:分别对搜索关键词和搜索结果的文本中特殊符号进行处理、英文大小写转换以及繁简字体转换等。这样,通过预处理将搜索关键词和搜索结果的文本处理成统一的格式。
第2步:采用多粒度分词方式,对预处理后的搜索关键词和搜索结果进行分词。
具体地:根据预设高频词集,分别对搜索关键词和搜索结果进行分词,若搜索关键词和搜索结果中包含预设高频词集中的高频词,则将高频词划分出,并针对文本的其余部分按字进行划分,分别获得搜索关键词与搜索结果对应的分词结果。
进一步地,为解决搜索关键词与各搜索结果的文本长度不一致的问题,本申请实施例中,采用固定长度方式,具体地,根据预设固定长度,若搜索关键词或各搜索结果的文本的长度大于该预设固定长度,则删除大于预设固定长度的部分文本,若小于该预设固定长度,则在文本的末尾添加空字符,以使文本的长度等于该预设固定长度。其中,预设固定长度,例如为15,并且搜索关键词与搜索结果对应的预设固定长度也可以不同,本申请实施例中并不进行限制,可以根据实际情况进行设置。
针对分词方式,传统分词方法主要有两种方式,一种方式是根据已有中文语料库训练 分词工具,然后基于训练的分词工具对文本进行分词。这种方式由于更关注上下文的语义,因此,分词结果更符合用户对文本的自然理解方式,适用于提取语义特征。但是这种方式并不适合精确匹配,主要是因为精确匹配关注的仅仅是两段文本完全一致的程度,若以这种方式得到的分词结果作为输入,可能会产生无法精确匹配的情况。
例如,“红色”和“红颜色”这两个文本,实际上这两个文本是有一定程度上的精确匹配的,但这种分词方式不会将“红色”两个字分开,并会将“红颜色”分为“红”和“颜色”两个分词,则从精确匹配角度来看,这两个文本之间是无相同部分,这显然是符合实际情况的。
另一种方式是不使用分词工具,而是采用固定长度分词。例如,“红色”按固定长度1进行分词,则分词结果为“红”和“色”,这种方式可以满足精确匹配需求,但是由于未考虑文本的上下文语义,很可能导致语义上本应该为一个词的内容被分开,从而影响整体语义性,不适用于提取语义信息。
因此,本申请实施例中,考虑到中文文本的特点,尤其是针对短文本,其通常是由多个词组成,如果分词粒度过大,导致只有输入特定搜索关键词才能搜索到该搜索结果,如果过小,则影响搜索的准确性,因此,本申请实施例中采用多粒度分词方式,对于高频词,考虑上下文的语义,将高频词整体划分出来,并且考虑精确匹配需求,对于非高频词,按字进行划分,即采用固定长度1进行分词,这样,可以综合考虑语义和精确匹配,针对文本中不同部分采用不同分词策略,即多粒度分词,从而满足文本中各个部分的要求,能更有效地从多层次挖掘文本中的信息,更适用于多搜索需求场景下的短文本匹配。
第3步:计算搜索关键词和搜索结果的精确匹配分数s 1
例如,若搜索关键词和搜索结果都可以被划分为15个分词,如搜索关键词对应划分为q 1,…,q 15,搜索结果对应划分为d 1,…,d 15。具体地:
首先(a),计算搜索关键词和搜索结果精确匹配的交互矩阵,例如为M∈R 15×15(M ij=[q i==d j])。本申请实施例中,由于假设搜索关键词和搜索结果的分词结果中都有15个分词,因此,每个分词分别进行精确匹配后,可以得到一个15*15的矩阵。
然后(b),计算交互特征。
例如,交互矩阵为
Figure PCTCN2019121928-appb-000006
其中,D∈R 15×15(D ij=|i-j|),I∈R 15×15(I ij=[i==j]),
Figure PCTCN2019121928-appb-000007
表示Kronecker product运算。
本申请实施例中,根据交互矩阵,挖掘文本的分词结果中各分词的相对位置关系,考虑文本语序,可以提高精确匹配的准确性。这是因为,传统技术中是直接根据交互矩阵挖 掘文本间精确匹配信息,而忽略相对位置的影响,这样容易存在精确匹配不准确的情况。例如,仅基于交互矩阵,文本“赛一赛”和“赛乐赛llz0978一”的精确匹配程度与“赛一赛”和自身的精确匹配程度相同,即均有三个字命中,但是很明显前者是没有后者更与“赛一赛”相似,因此,本申请实施例中考虑各分词的相对位置关系,即两个文本中“赛”和“一”的相对位置关系,可以挖掘出前者的相对位置相差较远,而后者更接近的信息,即后者精确匹配度应该更高,就可以更加准确地确定文本间的精确匹配程度。
然后(c),计算交互特征MF的卷积特征,并将卷积特征展平为向量。
具体地,将交互特征输入到预先训练的第一卷积网络中,获得第一卷积网络对交互特征进行特征提取后输出的交互特征对应的卷积特征,并展平为向量,获得交互特征对应的卷积特征的向量表示。
例如,参阅图4所示,为本申请实施例中精确匹配的第一卷积网络结构示意图。如图4可知,第一卷积网络包括5层卷积,输入交互特征,交互特征大小为1*255*255,经过5层卷积,最后输出512*7*7大小矩阵,即512个通道维度下的7*7的卷积矩阵,也即通过第一卷积网络,进行特征提取,提取出512个维度的卷积特征。其中,通道数512也可以根据实际情况进行设置,本申请实施例中并不进行限制。
其中,针对每一层卷积进行简单介绍,第一层卷积,先经过32个卷积核大小为3*3的二维卷积矩阵(2D-conv),然后经过归一化(batchnorm)处理,最后再经过窗口尺寸2*2的二维最大池化(2D-max pool)。针对第二层卷积到第五层卷积,每层增加卷积核的个数,依次为64、128、256、512,其余部分操作与第一层卷积相同。这样,经过5层卷积,最后得到512个维度下的7*7矩阵。
最后(d),根据交互特征对应的卷积特征,获得精确匹配分数s 1。具体地,将交互特征对应的卷积特征的向量表示,输入到预先训练的第一全连接网络;基于第一全连接网络,将卷积特征映射到第一预设向量空间,获得第一全连接网络输出的一维向量,并将输出的一维向量,作为相应的搜索结果的精确匹配分数。其中,交互特征对应的卷积特征的向量表示的获取方式为:将交互特征对应的卷积特征展平为向量,即获得其向量表示,例如,512*7*7大小的矩阵,可以展平为(512*7*7=25088)大小的向量。
例如,参阅图5所示,为本申请实施例中精确匹配的第一全连接网络结构示意图。如图5所示,第一全连接网络包括3层全连接层,输入为大小为25088的向量,经过3层全连接层的计算,最后输出大小为1的一维向量。具体地,第1层全连接,先经过维度大小为4096的全连接层(dense),然后,进行丢弃(dropout)处理操作;针对第2层全连接,先经过维度大小为4096的dense,然后,进行dropout处理操作;针对第3层全连接,先经 过维度大小为1的全连接层,进行dropout处理操作,最后映射到大小为1的向量。
这样,通过第一全连接网络,将第一卷积网络输出的卷积特征映射为固定长度的特征向量。
第4步:计算语义匹配权重向量。
其中,语义匹配权重向量表示精确匹配中各分词的相对位置信息对于语义匹配的影响。具体地:
(a)根据搜索关键词与搜索结果精确匹配的交互矩阵,将交互矩阵输入到预先训练的第二卷积网络,获得第二卷积网络对交互矩阵进行特征提取后输出的交互矩阵对应的卷积特征。
例如,第二卷积网络的卷积核的个数为16,内核大小(kernel size)为15,根据第3步中得到的交互矩阵M,将交互矩阵M通过1维的第二卷积网络,进行一次1维卷积操作,并将其展平为向量,得到其向量表示。例如,交互矩阵M大小为15*15,经过第二卷积网络后,输出15*16矩阵,并可以展平为大小为240的向量。
(b)将交互矩阵对应的卷积特征的向量表示,输入到预先训练的第二全连接网络;基于第二全连接网络,将卷积特征映射到第二预设向量空间,获得第二全连接网络输出的预设维度的向量,并将输出的预设维度的向量,作为相应的搜索结果的语义匹配权重向量。
例如,第二全连接网络包括一层全连接层,输入的维度为240向量,经过第二全连接网络的空间变换映射后,输出维度为64的向量,即得到语义匹配向量w∈R 64
这样,本申请实施例中,通过从交互矩阵中提取的语义匹配权重向量,可以满足当搜索请求更偏向于寻址需求时,语义匹配权重向量较低,可以减少语义匹配对最终相似度结果的影响,当搜索请求更偏向于泛需求时,语义匹配权重向量较高,可以减少精确匹配对最终相似度结果的影响,从而可以根据不同搜索需求,自适应的调节模型各部分对最终相似度结果的影响,提高相似度计算的准确性,使得最终排序靠前的搜索结果更加能满足用户需求。
第5步:计算搜索关键词和搜索结果的语义匹配表示向量。
例如,假设搜索关键词对应划分的分词结果为q′ 1,…,q′ 15,搜索结果对应划分的分词结果为d′ 1,…,d′ 15
具体地:
(a)根据预设的词向量模型,分别获得搜索关键词与搜索结果对应的分词结果中每个分词的词向量,并根据对应的分词结果中每个分词的词向量,分别获得搜索关键词与搜索 结果的词表示矩阵。例如,预设的词向量模型为Word2vec,当然也可以为其它词向量模型,本申请实施例中并不进行限制。例如,设置映射的词向量维度为300,则搜索关键词和搜索结果对应的分词结果中每个分词都可以映射为一个大小为300的词向量,若分词结果中有15个分词,则最终搜索关键词对应的词表示矩阵为Q∈R 15×300,搜索结果对应的词表示矩阵为T∈R 15×300
(b)分别将各词表示矩阵,输入到预先训练的第三卷积网络,获得第三卷积网络对各词表示矩阵进行特征提取后输出的各词表示矩阵对应的卷积特征。
例如,第三卷积网络的卷积核的个数为32,内核为3,则将搜索关键词和搜索结果的词表示矩阵,即Q和T,分别经过第三卷积网络,进行1次1维卷积操作,并展平为向量,可以分别获得大小为32的向量。
(c)将各词表示矩阵对应的卷积特征的向量表示,输入到预先训练的第三全连接网络;基于第三全连接网络,将各卷积特征映射到第三预设向量空间,获得第三全连接网络输出的各预设维度向量,并将输出的各预设维度的向量,分别作为相应的搜索关键词或搜索结果的语义表示向量。
例如,参阅图6所示,为本申请实施例中语义匹配的第三全连接网络结构示意图。如图6所示,第三全连接网络包括2层全连接层,输入为大小为32的向量,经过2层全连接层的计算,最后输出大小为32的向量。具体地,第1层全连接,先经过维度大小为32的dense,然后,进行dropout处理操作;针对第2层全连接,先经过维度大小为32的dense,然后,进行dropout处理操作,最后输出符合所需表示形式的大小为32的向量。例如,通过第三全连接网络后分别得到的搜索关键词和搜索结果的语义表示向量分别为f q,f t∈R 32
(d)将搜索关键词与搜索结果的语义表示向量进行合并拼接,获得合并拼接后的语义表示向量。
例如,合并拼接搜索关键词和搜索结果两个语义表示向量f q,f t,得到最终的合并拼接后的语义表示向量为f∈R 64
第6步:计算搜索关键词和搜索结果的语义匹配分数。
具体为:将合并拼接后的语义表示向量与相应的语义匹配权重向量,进行点积运算,获得搜索结果的语义匹配分数。例如,搜索关键词和搜索结果的语义匹配分数为:
Figure PCTCN2019121928-appb-000008
其中
Figure PCTCN2019121928-appb-000009
表示向量点积运算。
第7步:计算搜索关键词和搜索结果的相似度。
例如,最终搜索关键词和搜索结果的相似度的得分为:s=s 1+s 2。其中,s 1为精确匹
配分数,s 2为语义匹配分数。
本申请实施例中,提供了一种新的相似度模型算法,在进行精确匹配时,计算精确匹配的交互矩阵,并且考虑文本语序和上下文影响,确定分词结果中各分词的相对位置关系,确定交互特征,根据交互矩阵确定精确匹配分数,可以深度挖掘文本的结构信息,使得精确匹配结果更符合用户需求,并且将精确匹配的交互矩阵引入到语义匹配部分,确定语义匹配权重向量,根据语义匹配权重向量和语义表示向量,确定语义匹配分数,可以根据用户不同的搜索需求,调整语义匹配权重向量,提高相似度计算的准确性和可靠性,从而使得最终的相似度更加符合用户搜索需求,提高排序的准确性。
基于上述实施例可知,本申请实施例中搜索结果处理方法是通过相似度模型执行的,通过相似度模型对搜索关键词和各搜索结果进行处理,获得各搜索结果与搜索关键词的相似度,因此,相似度模型是否准确是很重要的,在应用之前需要先训练相似度模型,而通常好的模型需要有好的训练样本。
因此,本申请实施例中还提供了一种获得更可靠的训练样本的方法,针对原始数据进行处理,获得更为可靠的训练样本,并基于不同的训练样本,提供了一种多目标训练优化相似度模型的方式,从多个场景训练和优化相似度模型,可以提高相似度模型准确性,满足用户不同的搜索需求下的相似度计算,下面对本申请实施例中相似度模型的训练过程进行具体说明:
需要说明的是,通常训练过程是在后台服务器执行,由于模型的各个模块训练可能比较复杂,计算量较大,因此,由后台服务器实现训练过程,从而可以将训练好的模型和结果应用到各个智能终端,实现搜索结果相似度计算和排序。
具体参阅图7所示,为本申请实施例中相似度模型训练方法流程示意图,该方法包括:
步骤700:获取初始训练样本集。
执行步骤700时,具体包括:
首先,根据用户搜索点击行为,获取原始搜索点击记录集;其中,原始搜索点击记录集中每个原始搜索点击记录至少包括搜索关键词、曝光列表和点击结果,点击结果为被点击的搜索结果。
本申请实施例中,可以从原始的用户搜索点击行为日志中获取原始记录数据,通常每条记录由搜索关键词、曝光列表和用户点击结果组成。其中,曝光列表包括多个搜索结果。例如,用户输入搜索关键词“A”,可以获得曝光列表,即多个搜索结果,用户可以根据需求点击其中某个搜索结果,即将该被点击的搜索结果作为点击结果。
实际中,获取到的原始搜索点击记录集,通常噪声比较大,用户点击某个搜索结果也 是多样的,不一定点击的搜索结果就是用户所需的搜索结果,若直接将原始记录数据用于相似度模型训练,噪声较大,影响最终训练的模型的可靠性,因此,本申请实施例中设置了一定的规则,对原始记录数据进行过滤筛选,得到的数据更具有实用价值。
然后,基于预设规则,对原始搜索点击记录集进行过滤,将符合预设规则的原始搜索点击记录过滤掉,根据过滤后的原始搜索点击记录集,获得初始训练样本集。
其中,获得的初始训练样本集中各初始训练样本的数据形式,与原始搜索点击记录的数据形式保持一致,即初始训练样本集中每个训练样本也至少包括搜索关键词、曝光列表和点击结果。
具体地,参阅图8所示,为本申请实施例中初始样本集获取方法流程图。可知,针对原始搜索点击记录集,对原始搜索点击记录集进行过滤,设置的预设规则主要有三个方面:
1)第一个方面:根据搜索行为过滤。
预设规则为:搜索关键词与点击结果相同,和/或搜索关键词不符合预设长度范围。例如,若搜索关键词与点击结果完全相同,则说明搜索到完全一致的搜索结果,该条记录对于相似度模型训练的价值不高,则将对应的原始搜索点击记录过滤掉。又例如,对于搜索关键词不符合预设长度范围,例如过长或过短,对于相似度模型训练中,会增加数据处理难度,也影响准确性,因此过滤掉。
2)第二个方面:根据点击行为过滤。
预设规则为:搜索关键词对应的点击结果的数目大于第一预设数目、或点击结果在曝光列表中的位置位于预设排名之后,或搜索关键词相同但对应的点击结果不同的原始搜索点击记录数目大于第二预设数目。
例如,用户搜索某个搜索关键词,在返回的曝光列表中点击了多个搜索结果,对应多个点击结果,即用户单次搜索对应多次点击,则说明用户的点击结果的可参考性比较低,点击结果与搜索关键词的关联性可能不高,因此该记录的可靠性较低,不利于相似度模型的训练。
又例如,点击结果在曝光列表中位置偏后,例如,位于第100名,则可能用户点击的点击结果与搜索关键词的并没有关联,用户可能仅是随意搜索或随意点击了某个搜索结果,因此进行了过滤。
又例如,搜索关键词相同但对应的点击结果不同的原始搜索点击记录数目大于预设数目,即多条记录的搜索关键词相同,但对应的点击结果不同,可以认为是不同用户针对同一个搜索关键词,对应的点击结果过多,则说明这多条记录中存在点击结果与搜索关键词的关联性可能较低的情况,或者说明该搜索关键词可能并没有关联性很强的搜索结果,这 些记录会影响相似度模型的训练的准确性,因此本申请实施例中将这些记录数据过滤掉。
3)第三个方面:根据曝光列表过滤。
预设规则为:原始搜索点击记录中未有点击结果,或曝光列表中搜索结果的数目不符合预设数目范围。例如,用户搜索某个搜索关键词后,并没有点击行为,则原始搜索点击记录中没有点击结果,则不能用于相似度模型训练,需要过滤掉。又例如,原始搜索点击记录中曝光列表的长度过长或过短,即得到的搜索结果的数目过多或过少,则也不利于模型训练,进行过滤。这样,通过对原始搜索点击数据的过滤,去除掉一些不利于相似度模型训练的数据,可以从大量的原始搜索点击记录中,获得相对可靠的初始训练样本集。
步骤710:根据初始训练样本集,构建二元组数据训练集,并根据二元组数据训练集,训练初始化的相似度模型,获得训练后的第一相似度模型。
其中,二元组数据训练集中每个二元组数据至少包括搜索关键词、搜索结果、以及表征搜索关键词与搜索结果是否相似的标签。
本申请实施例中,基于初始训练样本集,根据不同的训练目标,构建了相应的训练样本集,从而提高相似度模型的准确性。其中,二元组数据为pairwise数据,一个pairwise数据包括两个文本以及一个标签,例如标签为0或1,用0或1表示这两个文本是否相似。假设一个pairwise数据为(A,B,1),则表示文本A和文本B相似。
基于二元组数据训练相似度模型的目标,其原理是将用户搜索结果假定为两类,一类是从文本匹配角度满足用户搜索关键词的,点击结果与搜索关键词相似的标签设为1,另一类是不满足的,点击结果与搜索关键词不相似的标签设为0,基于该目标原理,可以将初始训练样本集的数据分为两类,具体可以采用二分类模型进行分类,可以按照传统的有监督分类问题的训练方式优化二分类模型,从而可以基于优化后的二分类模型,将初始训练样本集分成两类,构建二元组数据训练集。
其中,初始训练样本集中每个初始训练样本的数据形式可以为{搜索关键词,点击结果,未被点击的搜索结果},未被点击的搜索结果可以为多个,即为曝光列表中除点击结果之外的搜索结果,当然该初始训练样本的数据形式并不进行限制,本申请实施例中只需要能够从初始训练样本中获知搜索关键词、点击结果和曝光列表中未被点击的搜索结果即可。
具体地,执行步骤710时,具体包括:
首先,根据初始训练样本集,构建二元组数据训练集。具体为:
1)根据曝光列表确定未被点击的搜索结果;
2)针对初始训练样本集中的每个初始训练样本,执行以下处理:
若确定点击结果和搜索关键词的文本匹配相似度大于第一预设阈值,确定未被点击的 搜索结果中与搜索关键词的文本匹配相似度不小于第二预设阈值的搜索结果,从未被点击的搜索结果中过滤掉确定出的搜索结果;若过滤后的未被点击的搜索结果非空,则将搜索关键词与点击结果构成正样本对,标签为1;从过滤后的未被点击的搜索结果中随机选择出一个搜索结果,将搜索关键词与随机选择出的一个搜索结果构成负样本对,标签为0。
其中,第一预设阈值和第二预设阈值,可以根据实际情况进行设置,本申请实施例中并不进行限制。即可以使用标签1表示相似,标签0表示不相似。
3)根据初始训练样本集对应的各正样本对和各负样本对,构建二元组数据训练集。
这样,可以根据初始训练样本集中每个初始训练样本,生成一个正样本对和一个负样本对,从而得到二元组数据训练集。
然后,根据二元组数据训练集,训练初始化的相似度模型,获得训练后的第一相似度模型。其中,该初始化的相似度模型输入为二元组数据训练集,输出为相似度结果,训练初始化的相似度模型的目标函数为输出的相似度结果与二元组数据训练集的标签之间的损失函数最小化。例如,可以采用Adam优化算法以及二进制跨熵损失(binary cross entropy loss)优化目标,不断优化训练该相似度模型,直至收敛,即损失函数不断下降并趋于稳定。
这样,本申请实施例中,二元组数据训练集,是表示搜索关键词与搜索结果是否相似的数据,将数据分成相似和不相似两类,基于二元组数据训练集训练相似度模型,即是基于分类优化目标训练相似度模型,可以不断优化相似度模型对搜索结果与搜索关键词的相似度判断的准确性。
步骤720:根据初始训练样本集,构建三元组数据训练集,并根据三元组数据训练集,训练第一相似度模型,获得训练后的第二相似度模型。
其中,三元组数据训练集中每个三元组数据至少包括搜索关键词、第一搜索结果、第二搜索结果,并且搜索关键词与第一搜索结果的相似度,大于搜索关键词与第二搜索结果的相似度。
本申请实施例中,步骤720的训练过程是基于上述步骤710的训练结果,步骤710中是将数据分成相似和不相似两类,从而训练相似度模型,但是实际中存在大量无法简单用分类准则划分的情况。
因此,本申请实施例中进一步采用排序优化目标再次训练第一相似度模型,构建三元组数据训练集,基于三元组数据训练集再次训练第一相似度模型,该目标的原理是通过不同文本对同一文本相似度的差异,来优化排序结果,例如,三元组数据为(搜索关键词,doc1,doc2),已知(搜索关键词,doc1)的匹配程度高于(搜索关键词,doc2)的匹配程度,则通过这样的三元组数据训练,使得相似度模型最终可以让(搜索关键词,doc1)计算 出的相似度得分要高于(搜索关键词,doc2),即可以使得该搜索关键词对应的搜索结果中doc1的排序位置较doc2更靠前。
具体地执行步骤720时,具体包括:
首先,根据初始训练样本集,构建三元组数据训练集。具体为:针对初始训练样本集中的每个初始训练样本,执行以下处理:
1)确定该初始训练样本对应的曝光列表中各搜索结果的置信度。具体包括:
a、分别统计该初始训练样本的点击结果在相应的曝光列表中的位置。
也就是说,一个初始训练样本中包括搜索关键词、曝光列表,以及曝光列表中被点击的点击结果,点击结果在曝光列表中的位置表示点击结果在曝光列表中排序在第几位。例如,第i条初始训练样本对应的点击结果在曝光列表中的位置为p i
b、根据统计的点击结果在曝光列表中的位置,分别确定曝光列表中每个位置被点击的频率。例如,曝光列表中第i位置被点击的频率可以为:
Figure PCTCN2019121928-appb-000010
c、根据点击结果在相应的曝光列表中的位置,确定每个搜索结果在相应的曝光列表中各个位置被点击的频率。例如,对于某个搜索结果i,假设点击结果为该搜索结果i的初始训练样本的数目有m个,且位置为q j(j=1,…,m),则该搜索结果在相应的曝光列表中各个位置被点击频率为
Figure PCTCN2019121928-appb-000011
d、根据每个搜索结果在各个位置被点击的频率和相应的曝光列表中每个位置被点击的频率,分别确定每个搜索结果在相应的曝光列表中各个位置被点击的置信度。例如,某个搜索结果i在相应的曝光列表中位置j被点击的置信度为:
Figure PCTCN2019121928-appb-000012
2)若一个点击结果的置信度相比于曝光列表中其它搜索结果的置信度最高,并该点击结果的置信度不小于预设分数值,以及确定未被点击的搜索结果中置信度最低的搜索结果的置信度小于预设分数值,则将搜索关键词和该点击结果、确定出的未被点击的置信度最低的搜索结果,构成一个三元组数据。
即置信度最高的点击结果为三元组数据中的第一搜索结果,未被点击的置信度最低的搜索结果为三元组数据中的第二搜索结果。
3)根据各初始训练样本对应的三元组数据,构建三元组数据训练集。
然后,根据三元组数据训练集,训练第一相似度模型,获得训练后的第二相似度模型。其中,该第一相似度模型的输入为三元组数据训练集,输出为两个相似度结果,训练第一相似度模型的目标函数为输出的两个相似度结果的大小关系与三元组数据训练集的两个相 似度大小关系之间的损失函数最小化。
例如,可以采用Adam优化算法以及三联铰链损失(triplet hinge loss)优化目标,不断优化训练该相似度模型,直至收敛,即损失函数不断下降并趋于稳定。其中,三联铰链损失是深度学习中的一种损失函数,用于训练差异性较小的样本,包括锚示例、正(Positive)示例和负(Negative)示例,通过优化锚示例与正示例的距离小于锚示例与负示例的距离的目标,实现样本的相似性计算。例如,锚示例与正样本距离为d +、锚示例与负样本距离为d -,则其训练的目标损失函数为:triplet hinge loss=max(0,d +-d -+α),其中α为预先设定的非负参数。
这样,本申请实施例中,三元组数据训练集,是表示两个搜索结果与搜索关键词相似的高低的数据,基于三元组数据训练集,再次训练第一相似度模型,即基于排序优化目标训练相似度模型,可以使得相似度模型得到的相似度结果对于排序更加准确。
目前,传统技术中缺乏对真实数据噪声的考虑,大部分以人工标记数据或者仅仅以用户搜索点击结果视为标记数据,用于模型训练,这种获取训练数据的方式在实际应用场景中较为困难,并且存在较大噪声,影响模型的训练。传统技术中,通常采用单目标训练,提供了两种方式,一种方式是采用传统有监督学习方式,根据有标签数据训练分类模型。但是,这种方式无法解决数据中存在大量不可分问题,即无法简单将数据标记为0或1的情况,如果保留这部分数据容易引入大量噪声,影响模型训练效果,但是去除这部分数据又会损失大量有价值信息;另一种方式则是采用无监督的排序优化模型方式,这种方式虽然可以更好的解决数据间仅存在细微差别,而无法简单分类的问题,但是在大量可分数据中采用这种方式则需要比前者更高数量级的训练数据,才能将模型训练达到相同效果。
本申请实施例中,考虑到搜索点击记录数据中同时存在大量可分数据和细粒度差异数据,若仅采用已有的单目标训练方式训练的模型可靠性较低,因此,本申请实施例中,对原始搜索点击记录集进行过滤,获得初始训练样本集,在初始训练样本集的基础上,提出了一种多目标优化方式,基于不同的优化目标,构建了不同的训练集,相比于传统的基于标注数据或仅以点击行为作为标注数据的训练方式,本申请实施例中,考虑点击行为和点击分布,构建二元组数据训练集和三元组数据训练集,对于相似度模型训练更具有针对性和实用价值,不仅能够有效从大量有噪声数据中获得相对可靠的训练数据,并且还可以减少人工标注的成本。
从而可以根据构建的训练集,分别训练相似度模型,可以同时优化两个目标,并且可以结合当前搜索场景分别对每个目标进行改进优化,使得最终训练得到的相似度模型,可 以兼顾用户的搜索泛需求和寻址需求,满足不同搜索场景下,相似度结果更加满足用户实际需求。
下面采用具体的应用场景,对相似度模型的训练过程进行具体说明。参阅图9所示,为本申请实施例中另一种相似度模型训练方法流程图。
步骤900:获取原始搜索点击记录集。
步骤901:根据预设规则,对原始搜索点击记录集进行过滤。
步骤902:获取过滤后的初始训练样本集。
步骤903:构建二元组数据训练集。
其中,二元组数据训练集中每个二元组数据至少包括搜索关键词、搜索结果、以及表征搜索关键词与搜索结果是否相似的标签。
步骤904:基于分类优化目标,获得训练后的第一相似度模型。
具体为:根据二元组数据训练集,训练初始化的相似度模型,获得第一相似度模型。
步骤905:构建三元组数据训练集。
其中,三元组数据训练集中每个三元组数据至少包括搜索关键词、第一搜索结果、第二搜索结果,并且搜索关键词与第一搜索结果的相似度,大于搜索关键词与第二搜索结果的相似度。
步骤906:基于排序优化目标,训练第一相似度模型。
具体为:根据三元组数据训练集,训练第一相似度模型,获得训练后的第二相似度模型。
步骤907:获得最终训练得到的相似度模型。
即训练后获得的第二相似度模型,为最终训练得到的相似度模型。
值得说明的是,本申请实施例中相似度模型的多目标训练方式,可以采用分阶段训练方式来分别优化每个目标,也可以采用同时训练方式或者交替训练方式来分别优化每个目标,对此并不进行限制。
指的说明是,本申请实施例中也可以将相似度模型的训练过程分成三个大的部分,上述步骤900-902为第一步,即获取初始训练样本集,步骤903-步骤904为第二步,基于第一个优化目标训练,即基于分类优化目标训练相似度模型,步骤905-步骤906为第三步,基于第二个优化目标训练,即基于排序优化目标对第一个优化目标训练后的相似度模型再次进行训练,最终获得经过两个优化目标的二次训练后的相似度模型。
基于上述实施例,可知相似度模型的训练过程中,主要是构建训练集,下面分别对上述步骤903和步骤905中构建二元组数据训练集和构建三元组数据训练集进行具体介绍:
参阅图10所示,为本申请实施例中构建二元组数据训练集方法流程图,包括:
步骤1000:输入初始训练样本集。
步骤1001:判断初始训练样本集是否为空,若是,则执行步骤1009,否则,则执行步骤1002。
步骤1002:取一个初始训练样本。
即每次从初始训练样本集中取一个初始训练样本,进行后续处理,并更新初始训练样本集,直至取完所有的初始训练样本。
步骤1003:判断搜索关键词的搜索结果的文本匹配相似度是否大于第一预设阈值,若是,则执行步骤1004,否则,则返回执行步骤1001。
其中,这里的文本匹配相似度为文本字面相似度。第一预设阈值,例如为0.01,可以根据实际情况进行设置。
步骤1004:确定未被点击的搜索结果中与搜索关键词的文本匹配相似度不小于第二预设阈值的搜索结果,从未被点击的搜索结果中过滤掉确定出的搜索结果。
其中,第二预设阈值,例如为0.8,可以根据实际情况进行设置。
步骤1005:判断过滤后的未点击的搜索结果是否非空,若是,则分别执行步骤1006和步骤1007,否则,则执行步骤1001。
步骤1006:将搜索关键词与点击结果构成正样本对。
此时,标签为1。
步骤1007:从过滤后的未被点击的搜索结果中随机选择出一个搜索结果,将搜索关键词与随机选择出的一个搜索结果构成负样本对。
此时,标签为0。
步骤1008:将正样本对和负样本对输出到二元组数据训练集。
步骤1009:输出最终的二元组数据训练集。
也就是说,每次针对一个初始训练样本,若能构建出正样本对和负样本对,则合并输出到二元组数据训练集中,依次处理初始训练样本集中每个初始训练样本,直至所有的初始训练样本均处理完,即得到最终的二元组数据训练集。
参阅图11所示,为本申请实施例中构建三元组数据训练集方法流程图,包括:
步骤1100:输入初始训练样本集。
步骤1101:判断初始训练样本集是否为空,若是,则执行步骤1008,否则,则执行步骤1102。
步骤1102:取一个初始训练样本。
即每次从初始训练样本集中取一个初始训练样本,进行后续处理,并更新初始训练样本集,直至取完所有的初始训练样本。
步骤1103:判断点击结果的置信度在曝光列表中是否最高,并且其置信度不小于预设分数值,若是,则执行步骤1104,否则,则返回执行步骤1101。
其中,预分数值,例如为1,可以根据实际情况进行设置。
步骤1104:确定未被点击的搜索结果中置信度最低的搜索结果。
步骤1105:判断该置信度最低的搜索结果的置信度是否小于预设分数值,若是,则执行步骤1106,否则,则执行步骤1101。
步骤1106:将搜索关键词和确定出的置信度最高的点击结果、确定出的未被点击的置信度最低的搜索结果,构成一个三元组数据。
步骤1107:将构成的三元组数据输出到三元组数据训练集。
步骤1108:输出最终的三元组数据训练集。
这样,基于不同的训练样本集和优化目标,对相似度模型进行训练,最终训练得到的相似度模型就可以实际应用了,例如,可以将训练好的相似度模型提供给线上服务,应用于搜索结果排序的场景,具体为:
步骤1:输入用户搜索关键词和所有召回的各搜索结果。
步骤2:依次将搜索关键词和各搜索结果输入到相似度模型中,分别计算搜索关键词和各搜索结果之间的相似度。
步骤3:将相似度结果作为一个特征,提供给排序模型,最终再由排序模型至少根据相似度结果,获得排序后的搜索结果。
其中,排序模型并不进行限制,排序模型可以根据相似度,从大到小,依次对各搜索结果进行排序,也可以同时结合其它的特征或因素,最终获得排序后的搜索结果。
这样,本申请实施例中,相似度模型的训练样本更加可靠,并且基于多目标训练方式进行训练,得到的相似度模型也更加可靠,相似度模型算法中结合了精确匹配和语义匹配,并且考虑不同搜索需求,设置了语义匹配权重向量,从而使得相似度模型计算出的相似度更加准确,基于相似度排序的搜索结果也更加符合用户的搜索需求,使得排序靠前的搜索结果更满足用户需求,提升了用户搜索效率,减少了时间。
基于上述实施例中,下面对各实施例的一种具体应用场景进行说明。本申请实施例中搜索结果处理方法,可以应用于微信中搜一搜功能中的垂直搜索,例如搜一搜中的公众号搜索场景。这里,搜一搜功能可以使得用户根据搜索关键词搜索朋友圈、文章、公众号、小说、音乐、表情等信息。
参阅图12所示,为本申请实施例中具体应用场景中搜一搜界面示意图,如图12的左图所示,为微信中搜一搜主要入口界面示意图,用户可以打开微信APP,点击其中的“搜一搜”功能,即可以跳转到“搜一搜”对应的界面,如图12的右图所示,为点击“搜一搜”进入后的主界面示意图,用户可以在该主界面中输入搜索关键词来搜索朋友圈、文章、公众号等信息。
例如,参阅图13所示,为本申请实施例中具体应用场景中搜一搜结果展示界面示意图,用户输入搜索关键词“跳一跳”可以得到相应的搜索结果,如图12的左图所示,方框内的为当前搜索结果类型,其中“全部”表示包括各种搜索结果类型,而“小程序”、“公众号”、“表情”等表示仅包含单一搜索结果类型,即垂直搜索场景。例如,针对公众号搜索场景,如图13的右图所示,点击“公众号”,可以得到关于“跳一跳”的公众号搜索结果,得到的各搜索结果即为曝光列表。
该场景下,用户搜索需求涉及多方面,最终点击行为也受到多个因素的影响,如何准确地将搜索关键词召回的搜索结果进行排序,并展示给用户是最重要的,因此,本申请实施例中,主要是通过相似度模型,计算搜索关键词与各搜索结果的相似度,并进行排序,将排序后的搜索结果进行展示,可以更好的挖掘用户搜索需求,将满足用户搜索需求的搜索结果在最终展示中排在更靠前的位置,以便用户访问,提高效率。
当然,本申请实施例中并不仅限于微信的搜一搜中的公众号搜索、小程序搜索等垂直搜索场景,还可以应用于其它场景,例如,QQ中的公众号搜索、今日头条中的头条号搜索、支付宝中的生活号及小程序搜索、移动端的安卓市场、App商场(store)等APP搜索、视频网站以及短视频APP中的视频搜索等,本申请实施例中并不进行限制。
基于上述实施例中,下面对各实施例的另一种具体应用场景进行说明,例如,参阅图14所示,为本申请实施例中具体应用场景中排序后搜索结果展示界面示意图。
例如,用户基于微信APP的“搜一搜”功能,进行垂直搜索,例如公众号垂直搜索场景,用户想要搜索“伯凡时间”公众号,但是不知道具体是哪个字,或者是输错了,如图14所示,在输入搜索关键词时,输入为了“博凡时间”,然后,基于本申请实施例中的搜索结果处理方法,对输入的搜索关键词和各搜索结果进行处理后,得到各搜索结果与搜索关键词的相似度,并进行排序,根据排序后的搜索结果,从前到后依次展示各搜索结果,如图14所示,基于本申请实施例中的搜索结果处理方法,最终排序靠前的为“伯凡时间”,这说明即使用户的搜索关键词的某个字输入错误,也可以很好地识别出更相似的搜索结果,并展示给用户,具有很好的容错性,排序靠前的搜索结果更加符合用户搜索需求,更加准确。
根据本申请实施例中所述的方法,和传统技术相比,进行了性能分析。针对释义识别任务,其目标是确定两个文本是否具有相同的含义。基于两个不同的数据集1和数据集2,对搜索结果与搜索关键词之间匹配的准确度以及衡量二分类模型精确度的指标F1-分数进行统计。具体数值如下表1所示,可见,对于这2个数据集,本申请实施例中所述的方法在准确度和F1-分数上都有明显的提高。
Figure PCTCN2019121928-appb-000013
表1
针对问题回答任务,其目标是将更加符合问题的文档排序更高,性能评估的度量包括:平均精度均值(MAP)、平均互惠排序(MRR)和归一化折算累计增益(NDCG)。针对两个数据集1和2的性能结果如下表2所示。可见,对于这2个数据集,本申请实施例中所述的方法在3个度量上都有明显的提高。这证明了本地交互信息对于短文本匹配是非常重要的。
Figure PCTCN2019121928-appb-000014
表2
此外,针对使用短的搜索关键词来获得短文本的搜索场景,下述表3给出了相应的性能数据。可见,在MAP和MRR这2个度量上,本申请实施例中所述的方法能够获得高于45%的提升;而在NDCG这个度量上,本申请实施例中所述的方法能够获得高于70%的提升。
Figure PCTCN2019121928-appb-000015
Figure PCTCN2019121928-appb-000016
表3
基于上述实施例,参阅图15所示,本申请实施例中,搜索结果处理装置具体包括:
获取模块1500,用于根据搜索关键词获取各搜索结果;
精确匹配模块1510,用于针对每个搜索结果,获得该搜索结果的精确匹配分数;
语义匹配模块1520,用于针对每个搜索结果,确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;
获得模块1530,用于针对每个搜索结果,根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。
在本申请另一实施例中,搜索结果处理装置还包括:
模型训练模块1540用于,预先创建并训练相似度模型;其中,所述精确匹配模块、所述语义匹配模块和所述获得模块根据所述模型训练模块得到的相似度模型进行处理。
参阅图16所示的另一种搜索结果处理装置结构中,模型训练模块1540包括:
获取模块1600,用于获取初始训练样本集;
第一训练模块1610,用于根据所述初始训练样本集,构建二元组数据训练集,并根据所述二元组数据训练集训练初始化的相似度模型,获得训练后的第一相似度模型;
第二训练模块1620,用于根据所述初始训练样本集,构建三元组数据训练集,并根据所述三元组数据训练集,训练所述第一相似度模型,获得训练后的第二相似度模型。
基于上述实施例,参阅图17所示,为本申请实施例中终端结构示意图,该终端具体包括:
第一接收模块1700,用于接收输入的搜索关键词;
发送模块1710,用于将接收到的搜索关键词发送给服务器,以使服务器执行上述任一项搜索结果处理方法,获得各搜索结果与所述搜索关键词的相似度,并根据各搜索结果与所述搜索关键词的相似度,获得排序后的搜索结果;
第二接收模块1720,用于接收服务器返回的排序后的搜索结果;
展示模块1730,用于展示排序后的搜索结果。
基于上述实施例,参阅图18所示,本申请实施例中,一种电子设备的结构示意图。
本申请实施例提供了一种电子设备,该电子设备可以包括处理器1810(Center Processing Unit,CPU)、存储器1820、输入设备1830和输出设备1840等。存储器1820可以包括只 读存储器(ROM)和随机存取存储器(RAM),并向处理器1810提供存储器1820中存储的程序指令和数据。在本申请实施例中,存储器1820可以用于存储本申请实施例中训练样本生成方法的程序。处理器1810通过调用存储器1820存储的程序指令,处理器1810用于按照获得的程序指令执行本申请实施例中任一种搜索结果处理方法。
基于上述实施例,本申请实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意方法实施例中的搜索结果处理方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (20)

  1. 一种搜索结果处理方法,由电子设备执行,包括:
    根据搜索关键词获取各搜索结果;
    针对每个搜索结果,执行如下处理:
    获得该搜索结果的精确匹配分数;
    确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;
    根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。
  2. 如权利要求1所述的方法,其中,所述获得该搜索结果的精确匹配分数,具体包括:
    采用多粒度分词方式,分别获得所述搜索关键词与该搜索结果对应的分词结果;
    将所述搜索关键词与该搜索结果对应的分词结果进行精确匹配,获得所述搜索关键词与该搜索结果的交互矩阵;
    根据所述搜索关键词与该搜索结果对应的分词结果中各分词的相对位置关系,以及所述交互矩阵,获得所述搜索关键词与该搜索结果的交互特征;
    根据所述交互特征,获得该搜索结果的精确匹配分数。
  3. 如权利要求2所述的方法,其中,所述根据所述交互特征,获得该搜索结果的精确匹配分数,具体包括:
    将所述交互特征输入到预先训练的第一卷积网络中,获得所述第一卷积网络对所述交互特征进行特征提取后输出的卷积特征;
    将所述卷积特征输入到预先训练的第一全连接网络,获得所述第一全连接网络输出的一维向量,将所述一维向量作为该搜索结果的精确匹配分数。
  4. 如权利要求2所述的方法,其中,所述确定该搜索结果的语义匹配权重向量,具体包括:
    将所述交互矩阵输入到预先训练的第二卷积网络,获得所述第二卷积网络对所述交互矩阵进行特征提取后输出的卷积特征;
    将所述卷积特征输入到预先训练的第二全连接网络,获得所述第二全连接网络输出的预设维度的向量,将该预设维度的向量作为所述语义匹配权重向量。
  5. 如权利要求1所述的方法,其中,所述确定所述搜索关键词与该搜索结果的语义表 示向量,具体包括:
    采用多粒度分词方式,分别获得所述搜索关键词与该搜索结果对应的分词结果;
    根据预设的词向量模型,获得所述搜索关键词对应的分词结果中每个分词的词向量,并根据所述词向量获得所述搜索关键词的词表示矩阵;将该词表示矩阵输入到预先训练的第三卷积网络,输出该词表示矩阵对应的卷积特征;将该卷积特征输入到预先训练的第三全连接网络,输出预设维度的向量作为所述搜索关键词的语义表示向量;
    根据所述词向量模型,获得该搜索结果对应的分词结果中每个分词的词向量,并根据所述词向量获得该搜索结果的词表示矩阵;将该词表示矩阵输入到所述第三卷积网络,输出该词表示矩阵对应的卷积特征;将该卷积特征输入到所述第三全连接网络,输出预设维度的向量作为该搜索结果的语义表示向量。
  6. 如权利要求2或5所述的方法,其中,所述采用多粒度分词方式,分别获得所述搜索关键词与该搜索结果对应的分词结果,具体包括:
    根据预设高频词集,若所述搜索关键词和该搜索结果中包含所述预设高频词集中的高频词,则将高频词划分出,并针对文本的其余部分按字进行划分,分别获得搜索关键词与各搜索结果对应的分词结果。
  7. 如权利要求1所述的方法,其中,所述根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数,具体包括:
    将所述搜索关键词的语义表示向量与该搜索结果的语义表示向量进行合并拼接,获得合并拼接后的语义表示向量;
    将合并拼接后的语义表示向量与该搜索结果的语义匹配权重向量,进行点积运算,获得该搜索结果的语义匹配分数。
  8. 如权利要求1所述的方法,还包括:
    预先创建并训练相似度模型;其中,将搜索关键词和各搜索结果输入到训练后的相似度模型中,输出各搜索结果与所述搜索关键词的相似度。
  9. 如权利要求8所述的方法,其中,所述预先训练相似度模型包括:
    获取初始训练样本集;
    根据所述初始训练样本集,构建二元组数据训练集,并根据所述二元组数据训练集训练初始化的相似度模型,获得训练后的第一相似度模型;
    根据所述初始训练样本集,构建三元组数据训练集,并根据所述三元组数据训练集,训练所述第一相似度模型,获得训练后的第二相似度模型。
  10. 如权利要求9所述的方法,其中,所述获取初始训练样本集包括:
    根据用户搜索点击行为,获取原始搜索点击记录集;
    根据预设规则,对所述原始搜索点击记录集进行过滤,得到所述初始训练样本集。
  11. 如权利要求9所述的方法,其中,所述二元组数据训练集中每个二元组数据至少包括搜索关键词、搜索结果、以及表征搜索关键词与搜索结果是否相似的标签。
  12. 如权利要求9所述的方法,其中,所述三元组数据训练集中每个三元组数据至少包括搜索关键词、第一搜索结果、第二搜索结果,并且搜索关键词与第一搜索结果的相似度,大于搜索关键词与第二搜索结果的相似度。
  13. 如权利要求9所述的方法,其中,所述初始训练样本集中每个训练样本至少包括搜索关键词、曝光列表和点击结果;其中,曝光列表为包括多个搜索结果的列表,点击结果为被点击的搜索结果;
    所述根据所述初始训练样本集,构建二元组数据训练集,具体包括:
    根据所述曝光列表确定未被点击的搜索结果;
    针对所述初始训练样本集中的每个初始训练样本,执行以下处理:
    若确定点击结果和搜索关键词的文本匹配相似度大于第一预设阈值,确定未被点击的搜索结果中与搜索关键词的文本匹配相似度不小于第二预设阈值的搜索结果,从未被点击的搜索结果中过滤掉确定出的搜索结果;
    若过滤后的未被点击的搜索结果非空,则将搜索关键词与点击结果构成正样本对;
    从过滤后的未被点击的搜索结果中随机选择出一个搜索结果,将搜索关键词与随机选择出的一个搜索结果构成负样本对;
    根据初始训练样本集对应的各正样本对和各负样本对,构建二元组数据训练集。
  14. 如权利要求9所述的方法,其中,所述初始训练样本集中每个训练样本至少包括搜索关键词、曝光列表和点击结果;其中,曝光列表为包括多个搜索结果的列表,点击结果为被点击的搜索结果;
    所述根据所述初始训练样本集,构建三元组数据训练集,具体包括:
    针对初始训练样本集中的每个初始训练样本,执行以下处理:
    确定该初始训练样本对应的曝光列表中各搜索结果的置信度;
    若一个点击结果的置信度相比于曝光列表中其它搜索结果的置信度最高,并该点击结果的置信度不小于预设分数值,以及确定未被点击的搜索结果中置信度最低的搜索结果的置信度小于预设分数值,则将搜索关键词和该点击结果、确定出的未被点击的置信度最低的搜索结果,构成一个三元组数据;
    根据各初始训练样本对应的三元组数据,构建三元组数据训练集。
  15. 一种搜索结果处理装置,包括:
    获取模块,用于根据搜索关键词获取各搜索结果;
    精确匹配模块,用于针对每个搜索结果,获得该搜索结果的精确匹配分数;
    语义匹配模块,用于针对每个搜索结果,确定该搜索结果的语义匹配权重向量、以及所述搜索关键词与该搜索结果的语义表示向量,并根据所述语义表示向量和所述语义匹配权重向量,获得该搜索结果的语义匹配分数;
    获得模块,用于针对每个搜索结果,根据所述精确匹配分数和所述语义匹配分数,获得该搜索结果与所述搜索关键词的相似度。
  16. 如权利要求15所述的装置,还包括:
    模型训练模块用于,预先创建并训练相似度模型;其中,所述精确匹配模块、所述语义匹配模块和所述获得模块根据所述模型训练模块得到的相似度模型进行处理。
  17. 如权利要求16所述的装置,其中,所述模型训练模块包括:
    获取模块,用于获取初始训练样本集;
    第一训练模块,用于根据所述初始训练样本集,构建二元组数据训练集,并根据所述二元组数据训练集训练初始化的相似度模型,获得训练后的第一相似度模型;
    第二训练模块,用于根据所述初始训练样本集,构建三元组数据训练集,并根据所述三元组数据训练集,训练所述第一相似度模型,获得训练后的第二相似度模型。
  18. 一种终端,包括:
    第一接收模块,用于接收输入的搜索关键词;
    发送模块,用于将接收到的搜索关键词发送给服务器,以使服务器执行上述权利要求1-14任一项所述的方法,获得各搜索结果与所述搜索关键词的相似度,并根据各搜索结果与所述搜索关键词的相似度,获得排序后的搜索结果;
    第二接收模块,用于接收服务器返回的排序后的搜索结果;
    展示模块,用于展示排序后的搜索结果。
  19. 一种电子设备,包括:
    至少一个存储器,用于存储程序指令;
    至少一个处理器,用于调用所述存储器中存储的程序指令,按照获得的程序指令执行上述权利要求1-14任一项所述的方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-14中任意一项所述方法的步骤。
PCT/CN2019/121928 2018-11-29 2019-11-29 搜索结果处理方法、装置、终端、电子设备及存储介质 WO2020108608A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/200,128 US11586637B2 (en) 2018-11-29 2021-03-12 Search result processing method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811444224.6 2018-11-29
CN201811444224.6A CN110162593B (zh) 2018-11-29 2018-11-29 一种搜索结果处理、相似度模型训练方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/200,128 Continuation US11586637B2 (en) 2018-11-29 2021-03-12 Search result processing method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2020108608A1 true WO2020108608A1 (zh) 2020-06-04

Family

ID=67645228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121928 WO2020108608A1 (zh) 2018-11-29 2019-11-29 搜索结果处理方法、装置、终端、电子设备及存储介质

Country Status (3)

Country Link
US (1) US11586637B2 (zh)
CN (1) CN110162593B (zh)
WO (1) WO2020108608A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115347A (zh) * 2020-07-17 2020-12-22 腾讯科技(深圳)有限公司 搜索结果的获取方法和装置及存储介质
CN112182154A (zh) * 2020-09-25 2021-01-05 中国人民大学 一种利用个人词向量消除关键词歧义的个性化搜索模型
CN112988971A (zh) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 基于词向量的搜索方法、终端、服务器及存储介质
CN113515621A (zh) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113590755A (zh) * 2021-08-02 2021-11-02 北京小米移动软件有限公司 词权重的生成方法、装置、电子设备及存储介质
CN113688280A (zh) * 2021-07-19 2021-11-23 广州荔支网络技术有限公司 一种排序方法、装置、计算机设备和存储介质
CN113793191A (zh) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 商品的匹配方法、装置及电子设备
CN114048285A (zh) * 2021-10-22 2022-02-15 盐城金堤科技有限公司 一种模糊检索方法、装置、终端及存储介质
CN114428902A (zh) * 2021-12-31 2022-05-03 北京百度网讯科技有限公司 信息搜索方法、装置、电子设备及存储介质
CN114676227A (zh) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 样本生成方法、模型的训练方法以及检索方法

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162593B (zh) 2018-11-29 2023-03-21 腾讯科技(深圳)有限公司 一种搜索结果处理、相似度模型训练方法及装置
CN111428125B (zh) * 2019-01-10 2023-05-30 北京三快在线科技有限公司 排序方法、装置、电子设备及可读存储介质
CN110851546B (zh) * 2019-09-23 2021-06-29 京东数字科技控股有限公司 一种验证、模型的训练、模型的共享方法、系统及介质
CN110737839A (zh) * 2019-10-22 2020-01-31 京东数字科技控股有限公司 短文本的推荐方法、装置、介质及电子设备
CN110765755A (zh) * 2019-10-28 2020-02-07 桂林电子科技大学 一种基于双重选择门的语义相似度特征提取方法
CN113127612A (zh) * 2019-12-31 2021-07-16 深圳市优必选科技股份有限公司 一种答复反馈方法、答复反馈装置及智能设备
CN111241294B (zh) * 2019-12-31 2023-05-26 中国地质大学(武汉) 基于依赖解析和关键词的图卷积网络的关系抽取方法
CN111061774B (zh) * 2020-01-17 2023-05-12 深圳云天励飞技术有限公司 搜索结果准确性判断方法、装置、电子设备及存储介质
US11645505B2 (en) * 2020-01-17 2023-05-09 Servicenow Canada Inc. Method and system for generating a vector representation of an image
CN111460264B (zh) * 2020-03-30 2023-08-01 口口相传(北京)网络技术有限公司 语义相似度匹配模型的训练方法及装置
CN113536156A (zh) * 2020-04-13 2021-10-22 百度在线网络技术(北京)有限公司 搜索结果排序方法、模型构建方法、装置、设备和介质
CN111666461B (zh) * 2020-04-24 2023-05-26 百度在线网络技术(北京)有限公司 检索地理位置的方法、装置、设备和计算机存储介质
CN111666292B (zh) 2020-04-24 2023-05-26 百度在线网络技术(北京)有限公司 用于检索地理位置的相似度模型建立方法和装置
CN111611452B (zh) * 2020-05-22 2023-05-02 上海携程商务有限公司 搜索文本的歧义识别方法、系统、设备及存储介质
CN113743430A (zh) * 2020-05-29 2021-12-03 北京沃东天骏信息技术有限公司 标签匹配度检测模型的建立方法及装置、存储介质及设备
CN111681679B (zh) * 2020-06-09 2023-08-25 杭州星合尚世影视传媒有限公司 视频物体音效搜索匹配方法、系统、装置及可读存储介质
CN111694919B (zh) * 2020-06-12 2023-07-25 北京百度网讯科技有限公司 生成信息的方法、装置、电子设备及计算机可读存储介质
CN111785383A (zh) * 2020-06-29 2020-10-16 平安医疗健康管理股份有限公司 数据处理方法及相关设备
CN111931002A (zh) * 2020-06-30 2020-11-13 华为技术有限公司 一种匹配方法以及相关设备
CN111831922B (zh) * 2020-07-14 2021-02-05 深圳市众创达企业咨询策划有限公司 一种基于互联网信息的推荐系统与方法
CN111950254B (zh) * 2020-09-22 2023-07-25 北京百度网讯科技有限公司 搜索样本的词特征提取方法、装置、设备以及存储介质
CN112507192A (zh) * 2020-09-24 2021-03-16 厦门立马耀网络科技有限公司 一种应用对比匹配方法、介质、系统和设备
CN112328890B (zh) * 2020-11-23 2024-04-12 北京百度网讯科技有限公司 搜索地理位置点的方法、装置、设备及存储介质
CN112487274B (zh) * 2020-12-02 2023-02-07 重庆邮电大学 一种基于文本点击率的搜索结果推荐方法及系统
CN112507082A (zh) * 2020-12-16 2021-03-16 作业帮教育科技(北京)有限公司 一种智能识别不当文本交互的方法、装置和电子设备
CN112700766B (zh) * 2020-12-23 2024-03-19 北京猿力未来科技有限公司 语音识别模型的训练方法及装置、语音识别方法及装置
CN112836012B (zh) * 2021-01-25 2023-05-12 中山大学 一种基于排序学习的相似患者检索方法
CN112989170A (zh) * 2021-03-24 2021-06-18 北京百度网讯科技有限公司 应用于信息搜索的关键词匹配方法、信息搜索方法及装置
CN113076395B (zh) * 2021-03-25 2024-03-26 北京达佳互联信息技术有限公司 语义模型训练、搜索显示方法、装置、设备及存储介质
CN113420056B (zh) * 2021-05-14 2023-12-26 北京达佳互联信息技术有限公司 行为数据处理方法、装置、电子设备及存储介质
CN113254770B (zh) * 2021-05-27 2024-01-12 北京达佳互联信息技术有限公司 内容排序方法、装置、服务器及存储介质
CN113344201A (zh) * 2021-06-22 2021-09-03 北京三快在线科技有限公司 一种模型训练的方法及装置
CN113407767A (zh) * 2021-06-29 2021-09-17 北京字节跳动网络技术有限公司 确定文本相关性的方法、装置、可读介质及电子设备
CN114281935A (zh) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 搜索结果分类模型的训练方法、装置、介质及设备
CN114065729A (zh) * 2021-11-16 2022-02-18 神思电子技术股份有限公司 一种基于深度文本匹配模型的文本排序方法
CN114329225B (zh) * 2022-01-24 2024-04-23 平安国际智慧城市科技股份有限公司 基于搜索语句的搜索方法、装置、设备及存储介质
CN115017294B (zh) * 2022-05-31 2023-05-09 武汉大学 代码搜索方法
CN115600646B (zh) * 2022-10-19 2023-10-03 北京百度网讯科技有限公司 语言模型的训练方法、装置、介质及设备
CN117014126B (zh) * 2023-09-26 2023-12-08 深圳市德航智能技术有限公司 基于信道拓展的数据传输方法
CN117743838B (zh) * 2024-02-20 2024-04-30 卓世智星(成都)科技有限公司 用于大语言模型的数据知识提取方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870984A (zh) * 2017-10-11 2018-04-03 北京京东尚科信息技术有限公司 识别搜索词的意图的方法和装置
CN108073576A (zh) * 2016-11-09 2018-05-25 上海诺悦智能科技有限公司 智能搜索方法、搜索装置以及搜索引擎系统
WO2018157703A1 (zh) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 自然语言的语义提取方法及装置和计算机存储介质
CN108780445A (zh) * 2016-03-16 2018-11-09 马鲁巴公司 用于对小数据的机器理解的并行分层模型
CN110162593A (zh) * 2018-11-29 2019-08-23 腾讯科技(深圳)有限公司 一种搜索结果处理、相似度模型训练方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720675B2 (en) * 2003-10-27 2010-05-18 Educational Testing Service Method and system for determining text coherence
US7689554B2 (en) * 2006-02-28 2010-03-30 Yahoo! Inc. System and method for identifying related queries for languages with multiple writing systems
US8301633B2 (en) * 2007-10-01 2012-10-30 Palo Alto Research Center Incorporated System and method for semantic search
US8719248B2 (en) * 2011-05-26 2014-05-06 Verizon Patent And Licensing Inc. Semantic-based search engine for content
CN103530415A (zh) * 2013-10-29 2014-01-22 谭永 一种兼容关键词搜索的自然语言搜索方法及系统
CN105677795B (zh) * 2015-12-31 2019-09-06 上海智臻智能网络科技股份有限公司 抽象语义的推荐方法、推荐装置及推荐系统
CN108205572A (zh) * 2016-12-20 2018-06-26 百度在线网络技术(北京)有限公司 一种搜索方法、装置及设备
CN107256242A (zh) 2017-05-27 2017-10-17 北京小米移动软件有限公司 搜索结果显示方法及装置、终端、服务器及存储介质
US10665122B1 (en) * 2017-06-09 2020-05-26 Act, Inc. Application of semantic vectors in automated scoring of examination responses
US10803055B2 (en) * 2017-12-15 2020-10-13 Accenture Global Solutions Limited Cognitive searches based on deep-learning neural networks
CN109635197B (zh) * 2018-12-17 2021-08-24 北京百度网讯科技有限公司 搜索方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108780445A (zh) * 2016-03-16 2018-11-09 马鲁巴公司 用于对小数据的机器理解的并行分层模型
CN108073576A (zh) * 2016-11-09 2018-05-25 上海诺悦智能科技有限公司 智能搜索方法、搜索装置以及搜索引擎系统
WO2018157703A1 (zh) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 自然语言的语义提取方法及装置和计算机存储介质
CN107870984A (zh) * 2017-10-11 2018-04-03 北京京东尚科信息技术有限公司 识别搜索词的意图的方法和装置
CN110162593A (zh) * 2018-11-29 2019-08-23 腾讯科技(深圳)有限公司 一种搜索结果处理、相似度模型训练方法及装置

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115347B (zh) * 2020-07-17 2023-12-12 腾讯科技(深圳)有限公司 搜索结果的获取方法和装置及存储介质
CN112115347A (zh) * 2020-07-17 2020-12-22 腾讯科技(深圳)有限公司 搜索结果的获取方法和装置及存储介质
CN112182154A (zh) * 2020-09-25 2021-01-05 中国人民大学 一种利用个人词向量消除关键词歧义的个性化搜索模型
CN112182154B (zh) * 2020-09-25 2023-10-10 中国人民大学 一种利用个人词向量消除关键词歧义的个性化搜索模型
CN113793191A (zh) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 商品的匹配方法、装置及电子设备
CN112988971A (zh) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 基于词向量的搜索方法、终端、服务器及存储介质
CN113515621A (zh) * 2021-04-02 2021-10-19 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113515621B (zh) * 2021-04-02 2024-03-29 中国科学院深圳先进技术研究院 数据检索方法、装置、设备及计算机可读存储介质
CN113688280A (zh) * 2021-07-19 2021-11-23 广州荔支网络技术有限公司 一种排序方法、装置、计算机设备和存储介质
CN113688280B (zh) * 2021-07-19 2024-04-05 广州荔支网络技术有限公司 一种排序方法、装置、计算机设备和存储介质
CN113590755A (zh) * 2021-08-02 2021-11-02 北京小米移动软件有限公司 词权重的生成方法、装置、电子设备及存储介质
CN114048285A (zh) * 2021-10-22 2022-02-15 盐城金堤科技有限公司 一种模糊检索方法、装置、终端及存储介质
CN114428902A (zh) * 2021-12-31 2022-05-03 北京百度网讯科技有限公司 信息搜索方法、装置、电子设备及存储介质
CN114428902B (zh) * 2021-12-31 2023-11-14 北京百度网讯科技有限公司 信息搜索方法、装置、电子设备及存储介质
CN114676227B (zh) * 2022-04-06 2023-07-18 北京百度网讯科技有限公司 样本生成方法、模型的训练方法以及检索方法
CN114676227A (zh) * 2022-04-06 2022-06-28 北京百度网讯科技有限公司 样本生成方法、模型的训练方法以及检索方法

Also Published As

Publication number Publication date
CN110162593B (zh) 2023-03-21
CN110162593A (zh) 2019-08-23
US11586637B2 (en) 2023-02-21
US20210224286A1 (en) 2021-07-22

Similar Documents

Publication Publication Date Title
WO2020108608A1 (zh) 搜索结果处理方法、装置、终端、电子设备及存储介质
Yu et al. Category-based deep CCA for fine-grained venue discovery from multimodal data
CN107515877B (zh) 敏感主题词集的生成方法和装置
WO2022116537A1 (zh) 一种资讯推荐方法、装置、电子设备和存储介质
CN110019732B (zh) 一种智能问答方法以及相关装置
US20210342658A1 (en) Polysemant meaning learning and search result display
US20170371965A1 (en) Method and system for dynamically personalizing profiles in a social network
CN110110225B (zh) 基于用户行为数据分析的在线教育推荐模型及构建方法
US8527564B2 (en) Image object retrieval based on aggregation of visual annotations
WO2018176913A1 (zh) 搜索方法、装置及非临时性计算机可读存储介质
CN112182145A (zh) 文本相似度确定方法、装置、设备和存储介质
Wei et al. Online education recommendation model based on user behavior data analysis
CN113743079A (zh) 一种基于共现实体交互图的文本相似度计算方法及装置
CN111274366A (zh) 搜索推荐方法及装置、设备、存储介质
CN112650869B (zh) 图像检索重排序方法、装置、电子设备及存储介质
CN115129864A (zh) 文本分类方法、装置、计算机设备和存储介质
CN115048504A (zh) 信息推送方法、装置、计算机设备及计算机可读存储介质
CN111061939B (zh) 基于深度学习的科研学术新闻关键字匹配推荐方法
CN114329206A (zh) 标题生成方法和装置、电子设备、计算机可读介质
CN113326438A (zh) 信息查询方法、装置、电子设备以及存储介质
CN114022233A (zh) 一种新型的商品推荐方法
CN113761125A (zh) 动态摘要确定方法和装置、计算设备以及计算机存储介质
CN115795023B (zh) 文档推荐方法、装置、设备以及存储介质
CN110727798A (zh) 一种基于朴素贝叶斯分类的节日情感分析方法
KR20190100533A (ko) 인공지능을 활용한 데이터베이스 모듈 및 이를 이용하는 경제데이터 제공 시스템 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19889863

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19889863

Country of ref document: EP

Kind code of ref document: A1