CN114021019B - Retrieval method integrating personalized search and diversification of search results - Google Patents

Retrieval method integrating personalized search and diversification of search results Download PDF

Info

Publication number
CN114021019B
CN114021019B CN202111327439.1A CN202111327439A CN114021019B CN 114021019 B CN114021019 B CN 114021019B CN 202111327439 A CN202111327439 A CN 202111327439A CN 114021019 B CN114021019 B CN 114021019B
Authority
CN
China
Prior art keywords
query
document
history
personalized
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111327439.1A
Other languages
Chinese (zh)
Other versions
CN114021019A (en
Inventor
窦志成
刘炯楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN202111327439.1A priority Critical patent/CN114021019B/en
Publication of CN114021019A publication Critical patent/CN114021019A/en
Application granted granted Critical
Publication of CN114021019B publication Critical patent/CN114021019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention realizes a retrieval method and a retrieval system integrating personalized search and diversification of search results by a method in the field of network security. The personalized score and the diversified score are calculated for the document from the two viewpoints of meeting the personalized requirements of the user and the novelty of the document. The final score for the document is obtained by assigning weights for the personalized score and the diversified score by calculating the similarity between the current query and the user history. The method provided by the invention improves the user satisfaction of the search result under the fuzzy query by comprehensively considering the different scores of the documents in the personalized aspect and the diversified aspect. And a hierarchical information extraction mechanism is designed by using a transformer mechanism, and a personalized score and a diversified score are respectively obtained from the user retrieval history and the candidate document set. And the model is trained by two different LambdaRank-like loss functions.

Description

Retrieval method integrating personalized search and diversification of search results
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a retrieval method integrating personalized search and diversification of search results.
Background
In search engines, the query presented by the user is typically short and ambiguous, possibly containing multiple topics. Personalized search and diversification of search results are two main types of methods for solving fuzzy query of users. Personalized searches model the user's personalized interests, primarily by analyzing the user's query history, to determine the user's intent to query currently. For example, when a user retrieves "apple," if the user has searched for "Microsoft", "Google", etc. in the retrieval history, we can basically determine the user's intention as apple company, so that it is not necessary to show documents related to fruit apples. The current method for modeling the user history behavior can be roughly divided into two types, wherein the first type is used for explicitly modeling the user interests by searching the history, and calculating the similarity between candidate documents and the user interests and the current query respectively to comprehensively serve as scoring; ge et al model user interests through layering RNNs and attention mechanisms, lu et al introduced generating an countermeasure network to train a model, ma et al replaced the common RNN in HRNN with a time-dependent RNN. The second model then uniformly considers the current query and the history to generate a revised query expression to match the document. The Zhou et al introduced a transducer structure to interact queries with the history, and Yao et al re-modeled the expression of the query by personalizing word embedding through the user history. The diversification method does not consider the history of the user, and different users can find the documents meeting the requirements from the same document sequence by returning the document sequence covering all the sub-topics of the current query. Because labeling sub-topics requires a large amount of work, implicit methods are relatively more convenient to deploy in large-scale search engines. Zhu et al propose a method for measuring dissimilarity between documents based on features such as title, text, anchor text, etc. between web pages, and introduce a learning to rank method into a search result diversification method. The method in the Xia et al optimization changes the optimization objective from maximum likelihood to pull the gap between the positive example sample and the negative example sample. Then, the Xia et al further say that tensor neural network is introduced into feature extraction, and the artificial design features in [ ] are converted into neural network extracted features. Qin et al introduced a transducer into the diversity of search results, and interacted with candidate documents via a transducer structure to evaluate the novelty of the documents.
Existing models often only use one of the search result diversification and personalized search techniques, however, both methods have respective advantages and disadvantages. The diversification of search results can return results covering all sub-topics, mainly considering the novelty of each document, but cannot comprehensively consider the historical interests of the current user. Personalized searches determine the intent of the current query by modeling the user's query history, but tend to result in documents in the returned results being too similar, and the user cannot obtain new information in the remaining documents after clicking on the first document. Comprehensively, when a user decides whether to click and browse a document, the personalized interests and the novelty of the document are comprehensively considered, and only one aspect is considered, so that the best effect cannot be obtained. The method and the device have the advantages that the combination of the diversification of the search results and the personalized search is considered, so that the search results which meet the personalized interests of the user and can ensure the diversification of the documents are returned.
Disclosure of Invention
Therefore, the invention firstly provides a retrieval method integrating personalized search and diversification of search results, wherein personalized scores and diversification scores are calculated for documents from two angles which meet the personalized demands of users and the novelty of the documents, and the scores can be used for finely arranging the documents returned to the users by a search engine, so that the returned documents can meet the demands of the users more.
Specifically, assuming that for query q, the corresponding candidate document set is D, the corresponding user is U, and the history record H of the user U can be divided into two parts, short-term historyAnd long-term history->Short term historyRepresenting a search history under the same session as the current query, wherein +.>Representing the ith short-term history query, +.>Representing candidate documents corresponding to the ith short-term history query, long-term historyRepresents the search history under all sessions before the current session, wherein +.>Represents the ithLong-term history inquiry->Representing candidate documents corresponding to the ith long-term history query, defining all candidate documents +.>M, comprehensively considering the diversity score and the individuation score for the document, integrating the final document score by taking the matching degree of the current query and the user history as weight, and carrying out formulated description on the whole scoring formula by using the following formula:
f(d|q,U,D)=λ(q,U)*S per (d|q,U)+(1-λ(q,U))*S div (d|D)
in the formula, S per For individualizing scoring model, S div For a diversified scoring model, λ (q, U) is a weight calculation that considers the degree of matching between the query and the history,
for the representation of the document and the query, firstly, a traditional word2vec method is used to obtain the expression of all words in the user history, and for the current query and the scoring document, a mode of summing all words and a mode of interacting through a transducer structure and then summing are adopted to obtain the original expression q init ,d init And interactive expression q int ,d int
In the formulaRepresenting the expression of each word in the query and document, trm represents a transducer,
the method comprises the steps of designing a structure based on a transducer to obtain comprehensive representation of a corresponding query and candidate documents, enabling the query to be q, enabling the documents to be d, enabling all words in the query and all words in the candidate documents to be a long sentence, and enabling the long sentence to interact through the transducer structure, wherein the process can be expressed as follows:
wherein T is q Andword sequences representing all word components of the query and the document, and then adding the query and the word expression corresponding to each document to obtain the document and the query expression q of the vector level w ,d w :
Using a transducer that takes into account click and position information to interact with the resulting document representation d w Further optimizing the expression of the document and the query, the formula is as follows:
D v =Trm doc (D w +D pos +D clk )
in the formula (i),is a document expression matrix obtained by term-level convertors, D pos ,D clk Representing the expression of the position information and the click information respectively, which are the learnable parameters, D v Is the final document expression;
training of the model takes place in two ways.
The personalized scoring model obtains a history vector representing a query by summing the query and the expression of the document, wherein the ith query of the short term historyThe method comprises the following steps:
ith query of long-term historyThe method comprises the following steps:
wherein the method comprises the steps ofRespectively representing the document and the query expression, thereby obtaining the history vector corresponding to each query in history and the history vector sequence corresponding to short-term history and long-term history +.>
For user historical intent vectors, a hierarchical transformer structure is employed: adding a [ CLS ] after the short-term history vector sequence]The labels are passed through a transducer taking into account the position information and the last vector is truncated, i.e. [ CLS ]]The vector at the corresponding position is used as the short-term historical intent vector u of the user s :u s =Trm short ([H s ,[CLS]]+[H s ,[CLS]] pos )[|s|+1]The method comprises the steps of carrying out a first treatment on the surface of the Adding short-term historical intent vector u after long-term historical vector sequence s And obtaining the short-term historical intent vector u of the user by a transducer taking the position information into consideration in the same way l :u l =Trm long ([H l ,u s ]+[H l ,u s ] pos )[|l|+1];
Then the obtained historical intent vector and the query vector are integrated through a plurality of gate mechanisms, and the formula is as follows:
gate(x,y)=z*x+(1-z)*y;z=σ(MLP([;y]))
u f =gate(u s ,u l );q s =gate(u s ,q int )
q l =gate(u l ,q int );q f =gate(u f ,q int )
wherein u is f Represents the final user historical intent vector, q s ,q l ,q f Respectively representing query vectors fused with short-term historical intent, long-term historical intent and final historical intent of a user, obtaining similarity of vector levels by calculating cosine degrees, and further calculating q by using KNRM init ,d init And q int ,d int And finally adding features into the similarity of the interaction level, calculating personalized scores on the features through the MLP, and finally synthesizing the scores to obtain personalized scores of the document:
s I (x,y)=KNRM(x,y);
the specific implementation mode of the diversified scoring model is as follows: first, z trainable weight matrices are designedWhere l represents the vector length. Through one of the weight matrices W i A similarity matrix between the documents is calculated as follows:
in the formula, the softmax function is completed on the column, and has z trainable weight matrixes, so that z similarity matrixes are obtained to form a tensor, and the formula is as follows:
extracting z similarity vectors s corresponding to the scoring document d from the similarity tensor [1:z] =S [1:z] [index(d)]Further, through the MLP layer as an aggregation function, the novelty scores of z aspects are extracted, and the tanh function is used as an activation function to be aligned with cosine similarity in the personalized scoring module:
ξ=tanh(ψ(s [1:z] ))
in the formulaRepresenting the novelty feature vector of the document in z dimensions, ψ represents the MLP layer, and then passing ζ through the MLP layer to obtain the final diversification score of the document:
where φ represents the MLP layer.
The weight calculation formula is as follows:
the two modes of training are specifically as follows:
in the first way, the user's satisfaction clicks are considered as a whole. The user's satisfaction clicks on the document as positive examples and the rest as negative examples, the LambdaRank penalty function is directly used to train the model.
Design (d) i ,d j ) For text pairs, Δ is the difference in the evaluation index between the two. If in the ideal ordering, d i Should be better than d j Then p ij =1, otherwise p ij =0. For convenience, f (d) i |q,I,D,D s ) Is f (d) i ) The loss function is as follows:
wherein,d, which can be regarded as model prediction i Better than d j Is a probability of (2). S is all that is used for training (d i ,d j ) Is a collection of the total of (1). L (L) unified Is the whole loss of the first method
In the second way, regarding the clicking behavior of the user as a whole may be detrimental to training of the model, split into personalized and diversified reasons:
L separate =L per +α*L div
wherein,can be regarded as model predictive, in individualization i Better than d j Probability of->Can be regarded as model predictive, d in terms of diversity i Better than d j Is a probability of (2). S is S per Is all data (d i ,d j ) Is the sum of (S) div Is all data for training in diversity (d i ,d j ) Is a collection of the total of (1). L (L) per And L div Loss, L, which can be considered as individuality and diversity reasons, respectively separate For the second method, the overall loss, alpha is the superparameter combining the two loss, and can be generally taken to be 0.5 in the model
In which a document that has not been clicked on by the user historically but is very similar to a clicked document is regarded as a pseudo-clicked document, in personalized loss L per These documents are also considered positive examples.
The invention has the technical effects that:
(1) The method comprises the steps of designing a structure based on a transducer, extracting information from user histories and candidate documents to calculate personalized scores and diversified scores, distributing weights according to the correlation degree of current queries and histories, and finally combining to obtain the score of the current document.
(2) Two different approaches have been proposed to train models by considering the user click behavior as a whole or separately considering the personalized interests and document novelty effects therein.
Drawings
The main architecture of the model of fig. 1;
Detailed Description
The following is a preferred embodiment of the present invention and a technical solution of the present invention is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The invention provides a retrieval method integrating personalized search and diversification of search results. Consider whether the integrated current document meets user intent and is different from other candidate documents. Our model calculates personalized and diversified scores for documents from both the perspective of meeting user personalized needs and document novelty. Further, if the current query and user history are highly correlated, we should naturally consider the personalized score of the document with emphasis, and the ratio of the diversified score should be reduced. Naturally, we can assign weights for the personalized score and the diversified score by calculating the similarity between the current query and the user history, thereby obtaining the final score for the document.
Assume for query q that the corresponding set of candidate documents is D and the corresponding user is U. The history H of the user U can be divided into two parts, short-term historyAnd long-term history->Short term history-> Refers to the search history under the same session as the current query, wherein +.>Representing the ith short-term history query, +.>And representing candidate documents corresponding to the ith short-term history query. Long-term history->Refers to the search history under all the sessions before the current session, wherein +.>Representing the ith long-term history query, +.>Representing candidate documents corresponding to the ith long-term history query. For convenience of subsequent expressions, we assume that all candidate documentsAnd the sizes of the two are m. As we describe above, a satisfactory click of a user often considers both whether the user is interested in itself and whether new information can be provided, so we score the document by comprehensively considering the diversification score and the personalized score, and integrate the final document score by taking the matching degree of the current query and the user history as weight
The overall scoring formula may be formulated as follows:
f(d|q,U,D)=λ(q,U)*S per (d|q,U)+(1-λ(q,U))*S div (d|D)
in the formula, S per For individualizing scoring model, S div And (5) grading models for diversification. λ (q, U) is the weight calculation section that considers the degree of matching between the query and the history.
For the representation of documents and queries, we first use the traditional word2vec method to get the expression of all words in the user history. For the current query and scoring document, we respectively adopt a mode of summing all words and a mode of interacting through a transducer structure and then summing to obtain the original expression q init ,d init And interactive expression q int ,d int
In the formulaRepresenting the expression of each word in the query and document, with Trm representing the transducer
Because both personalization and diversification require information to be extracted from a query and its corresponding candidate documents (personalization requires modeling user interest from historical queries and diversification requires modeling document novelty from current queries), we have designed a transformer-based structure to obtain a comprehensive representation of the corresponding query and candidate documents. For ease of expression, we let this query be q and the document be d
First, when inspired by HTPS [4], we resort to all words in the query and all words in the candidate documents as one long sentence, and interact this long sentence through a transducer structure, we consider this operation to be beneficial to optimizing the representation of the documents and the words in the query. For example, when a user queries "apple cell phone", words in the document that are related to digital technology help optimize the expression of "apple", and similarly, words in the document. The process can be formulated as follows:
wherein T is q Anda word sequence representing all word components of queries and documents. After passing this word-level transducer, we add the query to the word expression corresponding to each document to obtain a vector-level document and query expression q w ,d w :
As we said above, in the conventional personalized model, only the satisfactory clicked document of the user is often considered, while we consider that only the satisfactory clicked document may not be enough, and may cause some cases to be indistinguishable, as shown in the following figures:
in this figure, a, B, C, D, E represent different sub-topics, respectively. These three ranking if only click on document A is considered 1 ,B 1 The expression will be uniform, but it is apparent that the three ranking should be distinguished. For ranking 2, the user does not click on A 2 ,A 3 ,A 4 The reason for (a) may be that with A 1 Too similar to not clicking on C 1 ,D 1 The reasons for the documents are obviously different. To distinguish the three ranking, we use a transducer that takes into account click and position information to interact with the document representation d obtained above, inspired by BERT w Further optimizing the expression of the document and the query, the formula is as follows:
D v =Trm doc (D w +D pos +D clk )
in the formula (i),is a document expression matrix obtained by term-level convertors, D pos ,D clk Representing the representation of the location information and the representation of the click information, respectively, in our model are learnable parameters. D (D) v Is the document expression we finally get.
So far we have obtained an expression of documents and queries through a hierarchical transformer structure. At the subsequent personalized score and diversified score, the document and the query expression pass through different modules respectively to obtain the final personalized score and diversified score.
Next, the personalized score model, the diversified score model, and the weight calculation section will be described, respectively. The main part of the whole model is graphically represented as in fig. 1:
personalized scoring model
By the aboveThe hierarchical transformer structure, we get the query and the expression q of the document, respectively w ,D v For each query historically and its corresponding set of candidate documents, we simply add the expression of the query and document to obtain this historical vector representing the query. Ith query in short term historyJ-th query of long-term history->For example, the formula is as follows:
wherein the method comprises the steps ofRepresenting the above-ascertained document and query expression, respectively, wherein +.>Expression of kth document under ith query representing short term history,/>Expression of the kth document under the jth query representing the long-term history. By the above formula, we can obtain the history vector corresponding to each query historically, and the history vector sequence corresponding to the short-term history and the long-term history +.>
Similar to previous personalized search methods, we consider that there is a certain hierarchical relationship between the long-term history and the short-term history of the user. The short-term history of the user can reflect the intention of the user under the current session, and is relatively more accurate. The long-term history of the user is more stable, and the short-term history of the user can be corrected to a certain extent. For the user's historical intent vector, we use a hierarchical transducer structure:
we add a [ CLS ] after the short-term history vector sequence]The labels are passed through a transducer taking into account the position information and the last vector is truncated, i.e. [ CLS ]]The vector at the corresponding position is used as the short-term historical intent vector u of the user s
u s =Trm short ([H s ,[CLS]]+[H s ,[CLS]] pos )[|s|+1]
We add a short-term historical intent vector u after the long-term historical vector sequence s And obtaining the short-term historical intent vector u of the user by a transducer taking the position information into consideration in the same way l
u l =Trm long ([H l ,u s ]+[H l ,u s ] pos )[|l|+1]
Further, we synthesize the obtained historical intent vector and query vector through several gate mechanisms, the formula is as follows:
gate(x,y)=z*x+(1-z)*y;z=σ(MLP([x;y]))
u f =gate(u s ,u l );q s =gate(u s ,q int )
q l =gate(u l ,q int );q f =gate(u f ,q int )
wherein u is f Represents the final user historical intent vector, q s ,q l ,q f And respectively representing query vectors fused with the short-term historical intentions and the long-term historical intentions of the user, and finally obtaining the historical intentions. Above we have obtained some query vectors and document vectors, we get their vector level similarity by calculating cosine degrees (representation based similarity). Further, we use KNRM to calculate q init ,d init And q int ,d int Similarity of interaction level (interaction based similarity). Finally adding features, calculating personalized scores on the features through MLP, and finally integrating the scores to obtain personalized scores of the documents:
diversified scoring model
In the prior diversified model, when the novelty degree of the document is considered, the information of the current query is not added, so that when the structure proposed in the personalized scoring model is used in the diversified scoring model, the part related to the query is deleted, and finally, only the expression D of the candidate document under the current query is obtained v
We refer to the concept of neural tensor network in NTN to process the dissimilarity information of the current document.
First, we design z trainable weight matricesWhere l represents the vector length. Through one of the weight matrices W i We can calculate a similarity matrix between documents, as follows:
in the formula, the softmax function is done on the columns. Since there are z trainable weight matrices, we can get z similarity matrices, which constitute a tensor. The formula is as follows:
at this time, we can extract z similarity vectors s corresponding to the scoring document d from the similarity tensor [1:z] =S [1:z] [index(d)]. Further, we extract the novelty scores of the z aspects through the MLP layer as an aggregation function, and further, in order to align with cosine similarity in the personalized scoring module, we use the tanh function as an activation function:
ξ=tanh(ψ(s [1:z] ))
in the formulaRepresenting the novelty feature vector of the document in z dimensions, ψ represents the MLP layer. Finally, we get the final diversification score of the document from ζ through the MLP layer:
where φ represents the MLP layer.
Weight calculation
As we introduced above, one obvious idea is that the weight of the diversification score and the personalized score should be related to the degree of matching of the current query and the user history. In our model, we have adopted a relatively simple way to assign weights by calculating the cosine similarity of the two, in fact, if necessary, some learnable parameters can be added here to further improve the model performance, the formula is as follows:
model training
For our above ideas, we propose two different ways of training to train our model.
In the first way, we consider the user's satisfaction with clicking as a whole. The user's satisfaction clicks on the document as positive examples and the rest as negative examples, the LambdaRank penalty function is directly used to train the model.
Design (d) i ,d j ) For text pairs, Δ is the difference in the evaluation index between the two. If in the ideal ordering, d i Should be better than d j Then p ij =1, otherwise p ij =0. For convenience we shorthand f (d i |q,I,D,D s ) Is f (d) i ) The loss function is as follows:
in the second way, since in the model we score the document from both diversification and personalization, the clicking behavior of the user may also be seen as a combination of whether or not the personalized interests are met and whether or not novel. Considering the clicking behavior of the user as a whole may be disadvantageous to training of the model, we need to split it into personalized and diversified causes, and if we can take them as labels we can use a multitasking training way:
L separate =L per +α*L div
in our model, we treat documents that have not been clicked on by the user historically but are very similar to click documents as pseudo-click documents. In personalized loss L per These documents are also considered positive examples. In diversification loss, we consider the reason why this part of the document is not clicked is because diversification is willing, and thus consider it as a negative example.
In these loss, the positive and negative examples are constituted as follows, and in the graph, v represents the positive example, x represents the negative example, and x is neither the positive example nor the negative example.
Loss Clicking on a document Pseudo click document Remaining documents
L unified × ×
L per ×
L div × °

Claims (5)

1. A search method integrating personalized search and diversification of search results is characterized in that: calculating personalized scores and diversified scores for the documents from two angles meeting the personalized demands of the users and the novelty of the documents, and returning the documents meeting the demands of the users by applying the score fine-ranking search engine to the documents of the users;
specifically, assuming that for query q, the corresponding candidate document set is D, the corresponding user is U, and the history record H of the user U can be divided into two parts, short-term historyAnd long-term history->Short term history->Representing a search history under the same session as the current query, wherein +.>Representing the ith short-term history query, +.>Representing candidate documents corresponding to the ith short-term history query, long-term history->Represents the search history under all sessions before the current session, wherein +.>Representing the ith long-term history query, +.>Representing candidate documents corresponding to the ith long-term history query, defining all candidate documents +.>And D is consistent in size and m, the diversity score and the individuation score are comprehensively considered for scoring the document, the final document score is integrated by taking the matching degree of the current query and the user history as weight, and the whole scoring formula can be formulated and described by the following formula:
f(d|q,U,D)=λ(q,U)*S per (d|q,U)+(1-λ(q,U))*S div (d|D)
in the formula, S per For individualizing scoring model, S div For a diversified scoring model, λ (q, U) is a weight calculation that considers the degree of matching between the query and the history,
for the representation of the document and the query, firstly, a traditional word2vec method is used to obtain the expression of all words in the user history, and for the current query and the scoring document, a mode of summing all words and a mode of interacting through a transducer structure and then summing are adopted to obtain the original expression q init ,d init And interactive expression q int ,d int
In the formulaRepresenting the expression of each word in the query and document, trm represents a transducer,
the method comprises the steps of designing a structure based on a transducer to obtain comprehensive representation of a corresponding query and candidate documents, enabling the query to be q, enabling the documents to be d, enabling all words in the query and all words in the candidate documents to be a long sentence, and enabling the long sentence to interact through the transducer structure, wherein the process can be expressed as follows:
wherein T is q Andword sequences representing all word components of the query and the document, and then adding the query and the word expression corresponding to each document to obtain the document and the query expression q of the vector level w ,d w :
Using a transducer that takes into account click and position information to interact with the resulting document representation d w Further optimizing the expression of the document and the query, the formula is as follows:
D v =Trm doc (D w +D pos +D clk )
in the formula (i),is a document expression matrix obtained by term-level convertors, D pos ,D clk Representing the expression of the position information and the click information respectively, which are the learnable parameters, D v Is the final document expression;
training of the model takes place in two ways.
2. The method for searching for merging personalized search and diversification of search results according to claim 1, wherein the method comprises the steps of: the personalized scoring model obtains a history vector representing a query by summing the query and the expression of the document, wherein the ith query of the short term historyThe method comprises the following steps:
ith query of long-term historyThe method comprises the following steps:
wherein the method comprises the steps ofRespectively representing the document and the query expression, thereby obtaining the history vector corresponding to each query in history and the history vector sequence corresponding to short-term history and long-term history +.>
For user historical intent vectors, a hierarchical transformer structure is employed: adding a [ CLS ] after the short-term history vector sequence]The labels are passed through a transducer taking into account the position information and the last vector is truncated, i.e. [ CLS ]]The vector at the corresponding position is used as the short-term historical intent vector u of the user s :u s =Trm short ([H s ,[CLS]]+[H s ,[CLS]] pos )[|s|+1]The method comprises the steps of carrying out a first treatment on the surface of the Adding short-term historical intent vector us after long-term historical vector sequence, and obtaining short-term historical intent vector u of user by a transducer taking position information into consideration in the same way l :u l =Trm long ([H l ,u s ]+[H l ,u s ] pos )[|l|+1];
Then the obtained historical intent vector and the query vector are integrated through a plurality of gate mechanisms, and the formula is as follows:
gate(x,y)=z*x+(1-z)*y;z=σ(MLP([x;y]))
u f =gate(u s ,u l );q s =gate(u s ,q int )
q l =gate(u l ,q int );q f =gate(u f ,q int )
wherein u is f Represents the final user historical intent vector, q s ,q l ,q f Respectively representing query vectors fused with short-term historical intent, long-term historical intent and final historical intent of a user, obtaining similarity of vector levels by calculating cosine degrees, and further calculating q by using KNRM init ,d init And q int ,d int And finally adding features into the similarity of the interaction level, calculating personalized scores on the features through the MLP, and finally synthesizing the scores to obtain personalized scores of the document:
3. the method for searching for merging personalized search and diversification of search results according to claim 2, wherein the method comprises the steps of: the specific implementation mode of the diversified scoring model is as follows: first, z trainable weight matrices are designedWhere l represents the vector length, through one of the weight matrices W i A similarity matrix between the documents is calculated as follows:
in the formula, the softmax function is completed on the column, and has z trainable weight matrixes, so that z similarity matrixes are obtained to form a tensor, and the formula is as follows:
extracting z similarity vectors s corresponding to the scoring document d from the similarity tensor [1:z] =S [1:z] [index(d)]Further, through the MLP layer as an aggregation function, the novelty scores of z aspects are extracted, and the tanh function is used as an activation function to be aligned with cosine similarity in the personalized scoring module:
ξ=tanh(ψ(s [1:z] ))
in the formulaRepresenting the novelty feature vector of the document in z dimensions, ψ represents the MLP layer, and then passing ζ through the MLP layer to obtain the final diversification score of the document:
where phi represents the MLP layer,
the weight calculation formula is as follows:
4. a method of searching for a diversity of results of a search combining personalized searches as claimed in claim 3, wherein: the two modes of training are specifically as follows:
in the first way, the satisfaction click of the user is regarded as a whole, the satisfaction click document of the user is taken as a positive example, the rest documents are taken as negative examples, the LambdaRank loss function is directly used for training the model,
design (d) i ,d j ) For text pairs, Δ is the difference between the two evaluation indices, d if in the ideal ordering i Should be better than d j Then p ij =1, otherwise p ij =0, abbreviated as f (d) i |q,I,D,D s ) Is f (d) i ) The loss function is as follows:
wherein,d, which can be regarded as model prediction i Better than d j S is all the probabilities for training (d i ,d j ) Is the sum of (3) mL unified Is the overall loss of the first method;
in the second way, regarding the clicking behavior of the user as a whole may be detrimental to training of the model, split into personalized and diversified reasons:
L separate =L per +α*L div
wherein,for model prediction, in individualization aspect d i Better than d j Probability of->For model prediction, in the aspect of diversity d i Better than d j Probability of S per Is all data (d i ,d j ) Is the sum of (S) div Is all data for training in diversity (d i ,d j ) Is the sum of L per And L div Loss, L, which can be considered as individuality and diversity reasons, respectively separate For the second method, overall loss, alpha is the superparameter that merges the two loss,
where a document that has not been clicked on by the user historically but is very similar to a clicked document is considered a pseudo-clicked document, in the personalized lossL per These documents are also considered positive examples.
5. The method for searching for merging personalized search and diversification of search results according to claim 4, wherein the method comprises the steps of: alpha is 0.50.
CN202111327439.1A 2021-11-10 2021-11-10 Retrieval method integrating personalized search and diversification of search results Active CN114021019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111327439.1A CN114021019B (en) 2021-11-10 2021-11-10 Retrieval method integrating personalized search and diversification of search results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111327439.1A CN114021019B (en) 2021-11-10 2021-11-10 Retrieval method integrating personalized search and diversification of search results

Publications (2)

Publication Number Publication Date
CN114021019A CN114021019A (en) 2022-02-08
CN114021019B true CN114021019B (en) 2024-03-29

Family

ID=80063423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111327439.1A Active CN114021019B (en) 2021-11-10 2021-11-10 Retrieval method integrating personalized search and diversification of search results

Country Status (1)

Country Link
CN (1) CN114021019B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779193A (en) * 2012-07-16 2012-11-14 哈尔滨工业大学 Self-adaptive personalized information retrieval system and method
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
WO2018040503A1 (en) * 2016-08-30 2018-03-08 北京百度网讯科技有限公司 Method and system for obtaining search results

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7761464B2 (en) * 2006-06-19 2010-07-20 Microsoft Corporation Diversifying search results for improved search and personalization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779193A (en) * 2012-07-16 2012-11-14 哈尔滨工业大学 Self-adaptive personalized information retrieval system and method
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
WO2018040503A1 (en) * 2016-08-30 2018-03-08 北京百度网讯科技有限公司 Method and system for obtaining search results

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种长短期兴趣结合的个性化检索模型;王晓春;李生;杨沐昀;赵铁军;;中文信息学报;20160515(第03期);全文 *
基于协同相似计算的查询推荐;石雁;李朝锋;;计算机工程;20160815(第08期);全文 *
基于递归神经网络与注意力机制的动态个性化搜索算法;周雨佳;窦志成;葛松玮;文继荣;;计算机学报;20201231(第05期);全文 *

Also Published As

Publication number Publication date
CN114021019A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
Shi et al. Functional and contextual attention-based LSTM for service recommendation in mashup creation
Yan et al. Cross-modality bridging and knowledge transferring for image understanding
CN108932342A (en) A kind of method of semantic matches, the learning method of model and server
Zhu et al. Deep learning on information retrieval and its applications
Cao et al. Web services classification with topical attention based Bi-LSTM
Zhang et al. A deep joint network for session-based news recommendations with contextual augmentation
Ma et al. HAN-ReGRU: hierarchical attention network with residual gated recurrent unit for emotion recognition in conversation
Pandey et al. Natural language generation using sequential models: a survey
Parvin et al. Transformer-based local-global guidance for image captioning
Zhang et al. P2V: large-scale academic paper embedding
Rathi et al. The importance of Term Weighting in semantic understanding of text: A review of techniques
Khatter et al. Content curation algorithm on blog posts using hybrid computing
Liu et al. Deep bi-directional interaction network for sentence matching
Meddeb et al. Arabic text documents recommendation using joint deep representations learning
He et al. Conversation and recommendation: knowledge-enhanced personalized dialog system
Chawla Application of convolution neural network in web query session mining for personalised web search
Bensalah et al. Combining word and character embeddings for Arabic chatbots
Alatrash et al. Fine-grained sentiment-enhanced collaborative filtering-based hybrid recommender system
Lakizadeh et al. Text sentiment classification based on separate embedding of aspect and context
CN114021019B (en) Retrieval method integrating personalized search and diversification of search results
Bender et al. Unsupervised Estimation of Subjective Content Descriptions in an Information System.
Ni et al. An Improved Sequential Recommendation Algorithm based on Short‐Sequence Enhancement and Temporal Self‐Attention Mechanism
Liu et al. Fusing various document representations for comparative text identification from product reviews
Wang et al. A Sequential Recommendation Model for Balancing Long-and Short-Term Benefits
Qiu et al. Text-aware recommendation model based on multi-attention neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant