CN109063203B - Query term expansion method based on personalized model - Google Patents
Query term expansion method based on personalized model Download PDFInfo
- Publication number
- CN109063203B CN109063203B CN201811073589.2A CN201811073589A CN109063203B CN 109063203 B CN109063203 B CN 109063203B CN 201811073589 A CN201811073589 A CN 201811073589A CN 109063203 B CN109063203 B CN 109063203B
- Authority
- CN
- China
- Prior art keywords
- user
- word
- query
- words
- expansion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a query term expansion method based on an individualized model, which comprises the steps of firstly generating a user individualized model according to a label marked by a user on a resource and a score of the user on the resource; then, training the words of the corpus in a deep learning mode to obtain word vectors of the corpus, and mapping the words into high-dimensional vectors; calculating P words with the maximum similarity of each word in the query according to the trained corpus word vectors to serve as alternative expansion words; and sorting the alternative words from large to small by combining with the personalized model of the user to obtain final Q words as the expansion words. By combining the personalized model, the invention can better mine the extension words and improve the retrieval precision.
Description
Technical Field
The invention belongs to the technical field of information retrieval and natural language understanding and processing, and particularly relates to a query term expansion method based on a personalized model.
Background
The information retrieval technology is developed to the present, the query keyword proposed by the user is still the main basis for information retrieval, but the small amount of information provided in the query keyword causes the retrieval result to be not accurate enough, which is mainly embodied in two aspects: 1) when retrieval is carried out in a complex professional field, the content of the provided query keyword is ambiguous frequently because the knowledge of a user is limited; 2) due to different expression modes of users, the retrieval method based on the query keywords often cannot retrieve related contents.
Currently, personalized search has become an important method for users to search for required information, and the personalized search requires establishment of a personalized user model and continuous tracking of information requirements of the users. The personalized model can be mainly based on an individual user, and can also be modeled by a collaborative group. Meanwhile, as the social annotation system is widely popularized, the annotation and scoring behaviors of the user can be completely used for constructing the personalized model. Personalized searches also face the problem of the small amount of information provided against the query terms, which leads to less accurate retrieval results.
In order to solve the problems, researchers at home and abroad currently provide a series of personalized result reordering methods and achieve certain effect. However, while the personalized result reordering method is effective, it does not achieve any qualitative improvement if relevant results cannot be caught in the first round of search.
The query expansion can make up for the problem of insufficient information quantity of keywords provided by a user to a certain extent, and on the basis of the original keywords, the query expansion is realized by various methods, and the information retrieval is carried out by using query words with richer information quantity. Researchers at home and abroad carry out a series of effective researches on query expansion and provide some inspiring methods and technologies. The method comprises the step of expanding the query words by utilizing the relation among labels, the word co-occurrence relation, semantic matching based on a topic model and the like.
The above-described method, however, implies drawbacks. The tags may not necessarily reflect the resource content well, so that the query expansion efficiency by using the tag relationship is not high. The method of word co-occurrence ignores semantic relationships in the personalized model. Although the theme model method based on semantic matching improves the retrieval effect, the improvement is limited because no personalized model is considered.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention aims to provide a query term expansion method based on an individualized model, which better mines expanded terms and improves the retrieval precision by combining the individualized model.
In order to achieve the technical purpose, the technical scheme of the invention is as follows:
a query term expansion method based on a personalized model comprises the following steps:
(1) constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label;
(2) training word vectors in a large Corpus by using a method for training the word vectors in deep learning to generate word vectors Corpus of the Corpus;
(3) a user initiates a query, and a query sentence is preprocessed to generate a query word vector;
(4) sequentially calculating the similarity between each word in the Corpus and the vector of the query word, sorting the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set;
(5) for each word in the alternative expansion word set, calculating the similarity of each word and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity;
(6) and sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
Further, in step (1), the multi-tag based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriThe total number of tags.
Further, the label weight vi,kIs calculated as follows:
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i prefers the label.
Further, the thetai,kIs calculated as follows:
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k。
Further, in step (3), the generated query term vector is as follows:
in the above formula, the first and second carbon atoms are,for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
in the above formula, | qj| is a query word qjThe number of times that the query term vector occurs,for searching word vectorThe sum of the occurrence times of all the query terms in (1).
Further, in step (4), the word vector is queriedQuery term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
in step (5), the p-th word Term in the alternative extended word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity ofIs calculated as follows:
in the above formula, sim (×) represents the similarity calculation function.
Further, in step (5), the p-th word Term in the alternative extended word set TermpRank (2)pIs calculated as follows:
in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].
Further, in the step (3), the preprocessing includes punctuation removal, word segmentation, stop word removal and word drying.
Adopt the beneficial effect that above-mentioned technical scheme brought:
the method combines the personalized model with the query expansion technology, better accords with the individual interest query expansion of the user, and improves the query precision. The method provided by the invention adopts a method for training word vectors by deep learning, which is proposed by Google, aims to search semantic relations among words, and has an effect superior to the traditional statistical modes such as word co-occurrence and the like. The method of the invention has strong universality, can be suitable for the expansion of the query terms of various personalized searches, has good expandability and can be directly applied to a search engine system.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following with the accompanying drawings.
The invention provides a query term expansion method based on a personalized model, which comprises the following specific steps as shown in figure 1.
Step 1, constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label.
The preferred scheme of this step is that the multi-label based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriThe total number of tags.
The label weight vi,kIs calculated as follows:
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i prefers the label.
Theta is describedi,kIs calculated as follows:
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k。
And 2, training the word vectors in the large Corpus by utilizing a method for training the word vectors in deep learning to generate word vectors Corpus in the Corpus.
For example, the Skip-Gram model is a commonly used training model, and the present invention can utilize (but is not limited to) the Skip-Gram model, the training principle is to use the current word to predict the surrounding words, after the training is completed, each word is assigned a vector representation with a fixed length, and the length can be set by itself, such as 300 dimensions. The larger the corpus is, the better the training effect is, so an extra large corpus needs to be given for training the word vectors.
And 3, initiating query by the user, preprocessing the query sentence, and generating a query word vector.
The preferable scheme of this step is that the generated query term vector is as follows:
in the above formula, the first and second carbon atoms are,for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
in the above formula, | qj| is a query word qjThe number of times that the query term vector occurs,for searching word vectorThe sum of the occurrence times of all the query terms in (1).
The pretreatment comprises punctuation mark removal, word segmentation, stop word removal, word drying and the like.
And 4, sequentially calculating the similarity between each word in the Corpus and the query word vector, sequencing the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set.
The preferred scheme of this step is that the query term vectorQuery term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
in the above formula, sim (×) represents the similarity calculation function.
And 5, calculating the similarity of each word in the alternative expansion word set and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity.
The preferred scheme of the step is that the p-th word Term in the alternative extension word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity ofIs calculated as follows:
in the above formula, sim (×) represents the similarity calculation function.
The p-th word Term in the alternative extended word set TermpRank (2)pIs calculated as follows:
in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].
And 6, sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.
Claims (6)
1. A query term expansion method based on a personalized model is characterized by comprising the following steps:
(1) constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label;
the multi-tag based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriTotal number of tags;
the label weight vi,kIs calculated as follows:
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i likes the label more;
(2) training word vectors in a large Corpus by using a method for training the word vectors in deep learning to generate word vectors Corpus of the Corpus;
(3) a user initiates a query, and a query sentence is preprocessed to generate a query word vector;
(4) sequentially calculating the similarity between each word in the Corpus and the vector of the query word, sorting the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set;
(5) for each word in the alternative expansion word set, calculating the similarity of each word and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity;
(6) and sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
2. The personalized model based query term expansion method of claim 1, wherein θ isi,kIs calculated as follows:
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k。
3. The personalized model based query word expansion method according to claim 1 or 2, wherein in step (3), the generated query word vector is as follows:
in the above formula, the first and second carbon atoms are,for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
4. The personalized model-based query expansion engine of claim 3The exhibition method is characterized in that in the step (4), the word vector is inquiredQuery term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
in step (5), the p-th word Term in the alternative extended word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity ofIs calculated as follows:
in the above formula, sim (×) represents the similarity calculation function.
6. The method for expanding the query term based on the personalized model of claim 1, wherein in the step (3), the preprocessing comprises punctuation removal, word segmentation, stop word removal and word drying.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811073589.2A CN109063203B (en) | 2018-09-14 | 2018-09-14 | Query term expansion method based on personalized model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811073589.2A CN109063203B (en) | 2018-09-14 | 2018-09-14 | Query term expansion method based on personalized model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109063203A CN109063203A (en) | 2018-12-21 |
CN109063203B true CN109063203B (en) | 2020-07-24 |
Family
ID=64761795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811073589.2A Active CN109063203B (en) | 2018-09-14 | 2018-09-14 | Query term expansion method based on personalized model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063203B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102713905A (en) * | 2010-01-08 | 2012-10-03 | 瑞典爱立信有限公司 | A method and apparatus for social tagging of media files |
CN104866554A (en) * | 2015-05-15 | 2015-08-26 | 大连理工大学 | Personalized searching method and system on basis of social annotation |
CN105183803A (en) * | 2015-08-25 | 2015-12-23 | 天津大学 | Personalized search method and search apparatus thereof in social network platform |
CN106547864A (en) * | 2016-10-24 | 2017-03-29 | 湖南科技大学 | A kind of Personalized search based on query expansion |
CN108491462A (en) * | 2018-03-05 | 2018-09-04 | 昆明理工大学 | A kind of semantic query expansion method and device based on word2vec |
-
2018
- 2018-09-14 CN CN201811073589.2A patent/CN109063203B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102713905A (en) * | 2010-01-08 | 2012-10-03 | 瑞典爱立信有限公司 | A method and apparatus for social tagging of media files |
CN104866554A (en) * | 2015-05-15 | 2015-08-26 | 大连理工大学 | Personalized searching method and system on basis of social annotation |
CN105183803A (en) * | 2015-08-25 | 2015-12-23 | 天津大学 | Personalized search method and search apparatus thereof in social network platform |
CN106547864A (en) * | 2016-10-24 | 2017-03-29 | 湖南科技大学 | A kind of Personalized search based on query expansion |
CN108491462A (en) * | 2018-03-05 | 2018-09-04 | 昆明理工大学 | A kind of semantic query expansion method and device based on word2vec |
Also Published As
Publication number | Publication date |
---|---|
CN109063203A (en) | 2018-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN108415902B (en) | Named entity linking method based on search engine | |
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
CN110442777B (en) | BERT-based pseudo-correlation feedback model information retrieval method and system | |
CN106570708B (en) | Management method and system of intelligent customer service knowledge base | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
CN110502621A (en) | Answering method, question and answer system, computer equipment and storage medium | |
CN101271476B (en) | Relevant feedback retrieval method based on clustering in network image search | |
CN110298033A (en) | Keyword corpus labeling trains extracting tool | |
CN108388914B (en) | Classifier construction method based on semantic calculation and classifier | |
CN107480200B (en) | Word labeling method, device, server and storage medium based on word labels | |
CN103886020B (en) | A kind of real estate information method for fast searching | |
CN106547864B (en) | A kind of Personalized search based on query expansion | |
CN103778227A (en) | Method for screening useful images from retrieved images | |
CN109308321A (en) | A kind of knowledge question answering method, knowledge Q-A system and computer readable storage medium | |
CN107943919B (en) | A kind of enquiry expanding method of session-oriented formula entity search | |
CN102289514B (en) | The method of Social Label automatic marking and Social Label automatic marking device | |
CN108520038B (en) | Biomedical literature retrieval method based on sequencing learning algorithm | |
CN111159345B (en) | Chinese knowledge base answer acquisition method and device | |
CN112036178A (en) | Distribution network entity related semantic search method | |
CN113590779B (en) | Construction method of intelligent question-answering system of knowledge graph in air traffic control field | |
CN110705272A (en) | Named entity identification method for automobile engine fault diagnosis | |
CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
CN112836008B (en) | Index establishing method based on decentralized storage data | |
CN117149859B (en) | Urban waterlogging point information recommendation method based on government user portrait |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |