CN109063203B

CN109063203B - Query term expansion method based on personalized model

Info

Publication number: CN109063203B
Application number: CN201811073589.2A
Authority: CN
Inventors: 勾智楠; 韩立新; 袁宝华; 刘元珍; 张青
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2020-07-24
Anticipated expiration: 2038-09-14
Also published as: CN109063203A

Abstract

The invention discloses a query term expansion method based on an individualized model, which comprises the steps of firstly generating a user individualized model according to a label marked by a user on a resource and a score of the user on the resource; then, training the words of the corpus in a deep learning mode to obtain word vectors of the corpus, and mapping the words into high-dimensional vectors; calculating P words with the maximum similarity of each word in the query according to the trained corpus word vectors to serve as alternative expansion words; and sorting the alternative words from large to small by combining with the personalized model of the user to obtain final Q words as the expansion words. By combining the personalized model, the invention can better mine the extension words and improve the retrieval precision.

Description

Query term expansion method based on personalized model

Technical Field

The invention belongs to the technical field of information retrieval and natural language understanding and processing, and particularly relates to a query term expansion method based on a personalized model.

Background

The information retrieval technology is developed to the present, the query keyword proposed by the user is still the main basis for information retrieval, but the small amount of information provided in the query keyword causes the retrieval result to be not accurate enough, which is mainly embodied in two aspects: 1) when retrieval is carried out in a complex professional field, the content of the provided query keyword is ambiguous frequently because the knowledge of a user is limited; 2) due to different expression modes of users, the retrieval method based on the query keywords often cannot retrieve related contents.

Currently, personalized search has become an important method for users to search for required information, and the personalized search requires establishment of a personalized user model and continuous tracking of information requirements of the users. The personalized model can be mainly based on an individual user, and can also be modeled by a collaborative group. Meanwhile, as the social annotation system is widely popularized, the annotation and scoring behaviors of the user can be completely used for constructing the personalized model. Personalized searches also face the problem of the small amount of information provided against the query terms, which leads to less accurate retrieval results.

In order to solve the problems, researchers at home and abroad currently provide a series of personalized result reordering methods and achieve certain effect. However, while the personalized result reordering method is effective, it does not achieve any qualitative improvement if relevant results cannot be caught in the first round of search.

The query expansion can make up for the problem of insufficient information quantity of keywords provided by a user to a certain extent, and on the basis of the original keywords, the query expansion is realized by various methods, and the information retrieval is carried out by using query words with richer information quantity. Researchers at home and abroad carry out a series of effective researches on query expansion and provide some inspiring methods and technologies. The method comprises the step of expanding the query words by utilizing the relation among labels, the word co-occurrence relation, semantic matching based on a topic model and the like.

The above-described method, however, implies drawbacks. The tags may not necessarily reflect the resource content well, so that the query expansion efficiency by using the tag relationship is not high. The method of word co-occurrence ignores semantic relationships in the personalized model. Although the theme model method based on semantic matching improves the retrieval effect, the improvement is limited because no personalized model is considered.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention aims to provide a query term expansion method based on an individualized model, which better mines expanded terms and improves the retrieval precision by combining the individualized model.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a query term expansion method based on a personalized model comprises the following steps:

(1) constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label;

(2) training word vectors in a large Corpus by using a method for training the word vectors in deep learning to generate word vectors Corpus of the Corpus;

(3) a user initiates a query, and a query sentence is preprocessed to generate a query word vector;

(4) sequentially calculating the similarity between each word in the Corpus and the vector of the query word, sorting the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set;

(5) for each word in the alternative expansion word set, calculating the similarity of each word and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity;

(6) and sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.

Further, in step (1), the multi-tag based user personalization model is as follows:

User_i＝(t_i,1:v_i,1,t_i,2:v_i,2,...,t_i,n:v_i,n)

in the above formula, User_iFor a personalized model of user i, t_i,kIs a User_iThe kth tag in (v)_i,kIs t_i,kK is 1,2, …, n is User_iThe total number of tags.

Further, the label weight v_i,kIs calculated as follows:

in the above formula, α and β are hyper-parameters, α [0,1]，F_i,l、F_i,h、F_iLow score tag set, high score tag set, and no score tag set, labeled in order for user i, L_i,k、H_i,k、T_i,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and theta_i,kTo the extent of user i likes or dislikes the tag, θ_i,k∈[-1,1]A larger value indicates that user i prefers the label.

Further, the theta_i,kIs calculated as follows:

in the above formula, r_i,kScoring resources for user i with tag t_i,k，avg_i、max_i、min_iThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r is_i,k≥max_iOr max_i＞r_i,k≥avg_i，θ_i,x∈[0,1]Indicates that user i likes this resource and the corresponding tag t_i,kWhen avg is_i＞r_i,k＞min_iOr r_i,k≤min，θ_i,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag t_i,k。

Further, in step (3), the generated query term vector is as follows:

in the above formula, the first and second carbon atoms are,

for query word vectors, q_jRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w is_jIs q_jThe weight is calculated by the following formula:

in the above formula, | q_j| is a query word q_jThe number of times that the query term vector occurs,

for searching word vector

The sum of the occurrence times of all the query terms in (1).

Further, in step (4), the word vector is queried

Query term q in (1)_jAnd the t-th word Corpus in Corpus_tDegree of similarity of_tIs calculated as follows:

in step (5), the p-th word Term in the alternative extended word set Term_pSimilarity of User i and personalized model User of User i_iDegree of similarity of

Is calculated as follows:

in the above formula, sim (×) represents the similarity calculation function.

Further, in step (5), the p-th word Term in the alternative extended word set Term_pRank (2)_pIs calculated as follows:

in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].

Further, in the step (3), the preprocessing includes punctuation removal, word segmentation, stop word removal and word drying.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the method combines the personalized model with the query expansion technology, better accords with the individual interest query expansion of the user, and improves the query precision. The method provided by the invention adopts a method for training word vectors by deep learning, which is proposed by Google, aims to search semantic relations among words, and has an effect superior to the traditional statistical modes such as word co-occurrence and the like. The method of the invention has strong universality, can be suitable for the expansion of the query terms of various personalized searches, has good expandability and can be directly applied to a search engine system.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

The invention provides a query term expansion method based on a personalized model, which comprises the following specific steps as shown in figure 1.

Step 1, constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label.

The preferred scheme of this step is that the multi-label based user personalization model is as follows:

User_i＝(t_i,1:v_i,1,t_i,2:v_i,2,...,t_i,n:v_i,n)

The label weight v_i,kIs calculated as follows:

in the above formula, α and β are hyper-parameters, α [0,1]，F_i,l、F_i,h、F_iLow score tag set, high score tag set, and no score tag set, labeled in order for user i, L_i,k、H_i,k、T_i,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and theta_i,_kTo the extent of user i likes or dislikes the tag, θ_i,k∈[-1,1]A larger value indicates that user i prefers the label.

Theta is described_i,kIs calculated as follows:

And 2, training the word vectors in the large Corpus by utilizing a method for training the word vectors in deep learning to generate word vectors Corpus in the Corpus.

For example, the Skip-Gram model is a commonly used training model, and the present invention can utilize (but is not limited to) the Skip-Gram model, the training principle is to use the current word to predict the surrounding words, after the training is completed, each word is assigned a vector representation with a fixed length, and the length can be set by itself, such as 300 dimensions. The larger the corpus is, the better the training effect is, so an extra large corpus needs to be given for training the word vectors.

And 3, initiating query by the user, preprocessing the query sentence, and generating a query word vector.

The preferable scheme of this step is that the generated query term vector is as follows:

in the above formula, the first and second carbon atoms are,

for searching word vector

The sum of the occurrence times of all the query terms in (1).

The pretreatment comprises punctuation mark removal, word segmentation, stop word removal, word drying and the like.

And 4, sequentially calculating the similarity between each word in the Corpus and the query word vector, sequencing the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set.

The preferred scheme of this step is that the query term vector

in the above formula, sim (×) represents the similarity calculation function.

And 5, calculating the similarity of each word in the alternative expansion word set and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity.

The preferred scheme of the step is that the p-th word Term in the alternative extension word set Term_pSimilarity of User i and personalized model User of User i_iDegree of similarity of

Is calculated as follows:

in the above formula, sim (×) represents the similarity calculation function.

The p-th word Term in the alternative extended word set Term_pRank (2)_pIs calculated as follows:

in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].

And 6, sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. A query term expansion method based on a personalized model is characterized by comprising the following steps:

the multi-tag based user personalization model is as follows:

User_i＝(t_i,1:v_i,1,t_i,2:v_i,2,...,t_i,n:v_i,n)

in the above formula, User_iFor a personalized model of user i, t_i,kIs a User_iThe kth tag in (v)_i,kIs t_i,kK is 1,2, …, n is User_iTotal number of tags;

the label weight v_i,kIs calculated as follows:

in the above formula, α and β are hyper-parameters, α [0,1]，F_i,l、F_i,h、F_iLow score tag set, high score tag set, and no score tag set, labeled in order for user i, L_i,k、H_i,k、T_i,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and theta_i,kTo the extent of user i likes or dislikes the tag, θ_i,k∈[-1,1]A larger value indicates that user i likes the label more;

2. The personalized model based query term expansion method of claim 1, wherein θ is_i,kIs calculated as follows:

3. The personalized model based query word expansion method according to claim 1 or 2, wherein in step (3), the generated query word vector is as follows:

in the above formula, the first and second carbon atoms are,

for searching word vector

The sum of the occurrence times of all the query terms in (1).

4. The personalized model-based query expansion engine of claim 3The exhibition method is characterized in that in the step (4), the word vector is inquired

Is calculated as follows:

in the above formula, sim (×) represents the similarity calculation function.

5. The method for expanding query words based on personalized models according to claim 4, wherein in step (5), the p-th word Term in the alternative expanded word set Term_pRank (2)_pIs calculated as follows:

in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].

6. The method for expanding the query term based on the personalized model of claim 1, wherein in the step (3), the preprocessing comprises punctuation removal, word segmentation, stop word removal and word drying.