CN109063203B - Query term expansion method based on personalized model - Google Patents

Query term expansion method based on personalized model Download PDF

Info

Publication number
CN109063203B
CN109063203B CN201811073589.2A CN201811073589A CN109063203B CN 109063203 B CN109063203 B CN 109063203B CN 201811073589 A CN201811073589 A CN 201811073589A CN 109063203 B CN109063203 B CN 109063203B
Authority
CN
China
Prior art keywords
user
word
query
words
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811073589.2A
Other languages
Chinese (zh)
Other versions
CN109063203A (en
Inventor
勾智楠
韩立新
袁宝华
刘元珍
张青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201811073589.2A priority Critical patent/CN109063203B/en
Publication of CN109063203A publication Critical patent/CN109063203A/en
Application granted granted Critical
Publication of CN109063203B publication Critical patent/CN109063203B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a query term expansion method based on an individualized model, which comprises the steps of firstly generating a user individualized model according to a label marked by a user on a resource and a score of the user on the resource; then, training the words of the corpus in a deep learning mode to obtain word vectors of the corpus, and mapping the words into high-dimensional vectors; calculating P words with the maximum similarity of each word in the query according to the trained corpus word vectors to serve as alternative expansion words; and sorting the alternative words from large to small by combining with the personalized model of the user to obtain final Q words as the expansion words. By combining the personalized model, the invention can better mine the extension words and improve the retrieval precision.

Description

Query term expansion method based on personalized model
Technical Field
The invention belongs to the technical field of information retrieval and natural language understanding and processing, and particularly relates to a query term expansion method based on a personalized model.
Background
The information retrieval technology is developed to the present, the query keyword proposed by the user is still the main basis for information retrieval, but the small amount of information provided in the query keyword causes the retrieval result to be not accurate enough, which is mainly embodied in two aspects: 1) when retrieval is carried out in a complex professional field, the content of the provided query keyword is ambiguous frequently because the knowledge of a user is limited; 2) due to different expression modes of users, the retrieval method based on the query keywords often cannot retrieve related contents.
Currently, personalized search has become an important method for users to search for required information, and the personalized search requires establishment of a personalized user model and continuous tracking of information requirements of the users. The personalized model can be mainly based on an individual user, and can also be modeled by a collaborative group. Meanwhile, as the social annotation system is widely popularized, the annotation and scoring behaviors of the user can be completely used for constructing the personalized model. Personalized searches also face the problem of the small amount of information provided against the query terms, which leads to less accurate retrieval results.
In order to solve the problems, researchers at home and abroad currently provide a series of personalized result reordering methods and achieve certain effect. However, while the personalized result reordering method is effective, it does not achieve any qualitative improvement if relevant results cannot be caught in the first round of search.
The query expansion can make up for the problem of insufficient information quantity of keywords provided by a user to a certain extent, and on the basis of the original keywords, the query expansion is realized by various methods, and the information retrieval is carried out by using query words with richer information quantity. Researchers at home and abroad carry out a series of effective researches on query expansion and provide some inspiring methods and technologies. The method comprises the step of expanding the query words by utilizing the relation among labels, the word co-occurrence relation, semantic matching based on a topic model and the like.
The above-described method, however, implies drawbacks. The tags may not necessarily reflect the resource content well, so that the query expansion efficiency by using the tag relationship is not high. The method of word co-occurrence ignores semantic relationships in the personalized model. Although the theme model method based on semantic matching improves the retrieval effect, the improvement is limited because no personalized model is considered.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention aims to provide a query term expansion method based on an individualized model, which better mines expanded terms and improves the retrieval precision by combining the individualized model.
In order to achieve the technical purpose, the technical scheme of the invention is as follows:
a query term expansion method based on a personalized model comprises the following steps:
(1) constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label;
(2) training word vectors in a large Corpus by using a method for training the word vectors in deep learning to generate word vectors Corpus of the Corpus;
(3) a user initiates a query, and a query sentence is preprocessed to generate a query word vector;
(4) sequentially calculating the similarity between each word in the Corpus and the vector of the query word, sorting the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set;
(5) for each word in the alternative expansion word set, calculating the similarity of each word and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity;
(6) and sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
Further, in step (1), the multi-tag based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriThe total number of tags.
Further, the label weight vi,kIs calculated as follows:
Figure BDA0001800194760000031
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i prefers the label.
Further, the thetai,kIs calculated as follows:
Figure BDA0001800194760000032
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k
Further, in step (3), the generated query term vector is as follows:
Figure BDA0001800194760000041
in the above formula, the first and second carbon atoms are,
Figure BDA0001800194760000042
for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
Figure BDA0001800194760000043
in the above formula, | qj| is a query word qjThe number of times that the query term vector occurs,
Figure BDA0001800194760000044
for searching word vector
Figure BDA0001800194760000045
The sum of the occurrence times of all the query terms in (1).
Further, in step (4), the word vector is queried
Figure BDA0001800194760000046
Query term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
Figure BDA0001800194760000047
in step (5), the p-th word Term in the alternative extended word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity of
Figure BDA0001800194760000048
Is calculated as follows:
Figure BDA0001800194760000049
in the above formula, sim (×) represents the similarity calculation function.
Further, in step (5), the p-th word Term in the alternative extended word set TermpRank (2)pIs calculated as follows:
Figure BDA00018001947600000410
in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].
Further, in the step (3), the preprocessing includes punctuation removal, word segmentation, stop word removal and word drying.
Adopt the beneficial effect that above-mentioned technical scheme brought:
the method combines the personalized model with the query expansion technology, better accords with the individual interest query expansion of the user, and improves the query precision. The method provided by the invention adopts a method for training word vectors by deep learning, which is proposed by Google, aims to search semantic relations among words, and has an effect superior to the traditional statistical modes such as word co-occurrence and the like. The method of the invention has strong universality, can be suitable for the expansion of the query terms of various personalized searches, has good expandability and can be directly applied to a search engine system.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following with the accompanying drawings.
The invention provides a query term expansion method based on a personalized model, which comprises the following specific steps as shown in figure 1.
Step 1, constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label.
The preferred scheme of this step is that the multi-label based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriThe total number of tags.
The label weight vi,kIs calculated as follows:
Figure BDA0001800194760000051
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i prefers the label.
Theta is describedi,kIs calculated as follows:
Figure BDA0001800194760000061
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k
And 2, training the word vectors in the large Corpus by utilizing a method for training the word vectors in deep learning to generate word vectors Corpus in the Corpus.
For example, the Skip-Gram model is a commonly used training model, and the present invention can utilize (but is not limited to) the Skip-Gram model, the training principle is to use the current word to predict the surrounding words, after the training is completed, each word is assigned a vector representation with a fixed length, and the length can be set by itself, such as 300 dimensions. The larger the corpus is, the better the training effect is, so an extra large corpus needs to be given for training the word vectors.
And 3, initiating query by the user, preprocessing the query sentence, and generating a query word vector.
The preferable scheme of this step is that the generated query term vector is as follows:
Figure BDA0001800194760000071
in the above formula, the first and second carbon atoms are,
Figure BDA0001800194760000072
for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
Figure BDA0001800194760000073
in the above formula, | qj| is a query word qjThe number of times that the query term vector occurs,
Figure BDA0001800194760000074
for searching word vector
Figure BDA0001800194760000075
The sum of the occurrence times of all the query terms in (1).
The pretreatment comprises punctuation mark removal, word segmentation, stop word removal, word drying and the like.
And 4, sequentially calculating the similarity between each word in the Corpus and the query word vector, sequencing the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set.
The preferred scheme of this step is that the query term vector
Figure BDA0001800194760000076
Query term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
Figure BDA0001800194760000077
in the above formula, sim (×) represents the similarity calculation function.
And 5, calculating the similarity of each word in the alternative expansion word set and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity.
The preferred scheme of the step is that the p-th word Term in the alternative extension word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity of
Figure BDA0001800194760000078
Is calculated as follows:
Figure BDA0001800194760000079
in the above formula, sim (×) represents the similarity calculation function.
The p-th word Term in the alternative extended word set TermpRank (2)pIs calculated as follows:
Figure BDA0001800194760000081
in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].
And 6, sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims (6)

1. A query term expansion method based on a personalized model is characterized by comprising the following steps:
(1) constructing a user personalized model based on multiple labels, wherein the weight of each label can reflect the like or dislike of the user to the label;
the multi-tag based user personalization model is as follows:
Useri=(ti,1:vi,1,ti,2:vi,2,...,ti,n:vi,n)
in the above formula, UseriFor a personalized model of user i, ti,kIs a UseriThe kth tag in (v)i,kIs ti,kK is 1,2, …, n is UseriTotal number of tags;
the label weight vi,kIs calculated as follows:
Figure FDA0002518524240000011
in the above formula, α and β are hyper-parameters, α [0,1],Fi,l、Fi,h、FiLow score tag set, high score tag set, and no score tag set, labeled in order for user i, Li,k、Hi,k、Ti,kSequentially providing a low-grade label set, a high-grade label set and a no-grade label set in the kth label of the user i, | represents the number of elements in the set, and thetai,kTo the extent of user i likes or dislikes the tag, θi,k∈[-1,1]A larger value indicates that user i likes the label more;
(2) training word vectors in a large Corpus by using a method for training the word vectors in deep learning to generate word vectors Corpus of the Corpus;
(3) a user initiates a query, and a query sentence is preprocessed to generate a query word vector;
(4) sequentially calculating the similarity between each word in the Corpus and the vector of the query word, sorting the words from large to small according to the similarity, and selecting the first P words as an alternative expansion word set;
(5) for each word in the alternative expansion word set, calculating the similarity of each word and the user personalized model constructed in the step (1), and calculating the final ranking value of each alternative expansion word according to the similarity;
(6) and sequencing the alternative expansion words from large to small according to the final sequencing value, and selecting the first Q words as the expansion words.
2. The personalized model based query term expansion method of claim 1, wherein θ isi,kIs calculated as follows:
Figure FDA0002518524240000021
in the above formula, ri,kScoring resources for user i with tag ti,k,avgi、maxi、miniThe average score, the highest score and the lowest score of the user i evaluation are sequentially obtained, when r isi,k≥maxiOr maxi>ri,k≥avgi,θi,x∈[0,1]Indicates that user i likes this resource and the corresponding tag ti,kWhen avg isi>ri,k>miniOr ri,k≤min,θi,x∈ [ -1,0) indicating that user i dislikes this resource and the corresponding tag ti,k
3. The personalized model based query word expansion method according to claim 1 or 2, wherein in step (3), the generated query word vector is as follows:
Figure FDA0002518524240000022
in the above formula, the first and second carbon atoms are,
Figure FDA0002518524240000023
for query word vectors, qjRepresents one of the query words, j is 1,2, …, m is the number of the query words in the query word vector, w isjIs qjThe weight is calculated by the following formula:
Figure FDA0002518524240000024
in the above formula, | qj| is a query word qjThe number of times that the query term vector occurs,
Figure FDA0002518524240000025
for searching word vector
Figure FDA0002518524240000026
The sum of the occurrence times of all the query terms in (1).
4. The personalized model-based query expansion engine of claim 3The exhibition method is characterized in that in the step (4), the word vector is inquired
Figure FDA0002518524240000031
Query term q in (1)jAnd the t-th word Corpus in CorpustDegree of similarity oftIs calculated as follows:
Figure FDA0002518524240000032
in step (5), the p-th word Term in the alternative extended word set TermpSimilarity of User i and personalized model User of User iiDegree of similarity of
Figure FDA0002518524240000033
Is calculated as follows:
Figure FDA0002518524240000034
in the above formula, sim (×) represents the similarity calculation function.
5. The method for expanding query words based on personalized models according to claim 4, wherein in step (5), the p-th word Term in the alternative expanded word set TermpRank (2)pIs calculated as follows:
Figure FDA0002518524240000035
in the above formula, gamma is a hyperparameter, gamma ∈ [0,1 ].
6. The method for expanding the query term based on the personalized model of claim 1, wherein in the step (3), the preprocessing comprises punctuation removal, word segmentation, stop word removal and word drying.
CN201811073589.2A 2018-09-14 2018-09-14 Query term expansion method based on personalized model Active CN109063203B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811073589.2A CN109063203B (en) 2018-09-14 2018-09-14 Query term expansion method based on personalized model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811073589.2A CN109063203B (en) 2018-09-14 2018-09-14 Query term expansion method based on personalized model

Publications (2)

Publication Number Publication Date
CN109063203A CN109063203A (en) 2018-12-21
CN109063203B true CN109063203B (en) 2020-07-24

Family

ID=64761795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811073589.2A Active CN109063203B (en) 2018-09-14 2018-09-14 Query term expansion method based on personalized model

Country Status (1)

Country Link
CN (1) CN109063203B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102713905A (en) * 2010-01-08 2012-10-03 瑞典爱立信有限公司 A method and apparatus for social tagging of media files
CN104866554A (en) * 2015-05-15 2015-08-26 大连理工大学 Personalized searching method and system on basis of social annotation
CN105183803A (en) * 2015-08-25 2015-12-23 天津大学 Personalized search method and search apparatus thereof in social network platform
CN106547864A (en) * 2016-10-24 2017-03-29 湖南科技大学 A kind of Personalized search based on query expansion
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102713905A (en) * 2010-01-08 2012-10-03 瑞典爱立信有限公司 A method and apparatus for social tagging of media files
CN104866554A (en) * 2015-05-15 2015-08-26 大连理工大学 Personalized searching method and system on basis of social annotation
CN105183803A (en) * 2015-08-25 2015-12-23 天津大学 Personalized search method and search apparatus thereof in social network platform
CN106547864A (en) * 2016-10-24 2017-03-29 湖南科技大学 A kind of Personalized search based on query expansion
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec

Also Published As

Publication number Publication date
CN109063203A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN108415902B (en) Named entity linking method based on search engine
CN106570708B (en) Management method and system of intelligent customer service knowledge base
CN110442777B (en) BERT-based pseudo-correlation feedback model information retrieval method and system
CN110427563B (en) Professional field system cold start recommendation method based on knowledge graph
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN101271476B (en) Relevant feedback retrieval method based on clustering in network image search
CN108388914B (en) Classifier construction method based on semantic calculation and classifier
US20120158703A1 (en) Search lexicon expansion
CN104765769A (en) Short text query expansion and indexing method based on word vector
CN103886020B (en) A kind of real estate information method for fast searching
CN106547864B (en) A kind of Personalized search based on query expansion
CN109308321A (en) A kind of knowledge question answering method, knowledge Q-A system and computer readable storage medium
CN107480200B (en) Word labeling method, device, server and storage medium based on word labels
CN106126605B (en) Short text classification method based on user portrait
CN111159345B (en) Chinese knowledge base answer acquisition method and device
CN112036178A (en) Distribution network entity related semantic search method
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN113590779B (en) Construction method of intelligent question-answering system of knowledge graph in air traffic control field
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN112836008B (en) Index establishing method based on decentralized storage data
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant