CN108491462A - A kind of semantic query expansion method and device based on word2vec - Google Patents

A kind of semantic query expansion method and device based on word2vec Download PDF

Info

Publication number
CN108491462A
CN108491462A CN201810179478.3A CN201810179478A CN108491462A CN 108491462 A CN108491462 A CN 108491462A CN 201810179478 A CN201810179478 A CN 201810179478A CN 108491462 A CN108491462 A CN 108491462A
Authority
CN
China
Prior art keywords
word
expansion
inquiry
query
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810179478.3A
Other languages
Chinese (zh)
Other versions
CN108491462B (en
Inventor
章露露
贾连印
李孟娟
丁家满
李晓武
陈文焰
吕晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810179478.3A priority Critical patent/CN108491462B/en
Publication of CN108491462A publication Critical patent/CN108491462A/en
Application granted granted Critical
Publication of CN108491462B publication Critical patent/CN108491462B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of semantic query expansion method and device based on word2vec, belongs to technical field of information retrieval.The method of the present invention includes:The pre-treatment step of the given inquiry of user:Word segmentation processing is carried out to inquiry, remove stop words and carries out stem reduction;Expansion word Candidate Set selecting step:Initial extension word is chosen using word2vec tools;Establish extension vocabulary step:Expansion word Candidate Set is filtered, true extension vocabulary is established;Query expansion step:User's inquiry and its expansion word are matched with indexed set, return to relevant documentation and sorted.The present invention proposes a kind of query vector generation method towards expansion word to filter candidate expansion word and build extension vocabulary, to the correlation for preferably embodying expansion word and entirely inquiring, and then the effect of raising query expansion.

Description

A kind of semantic query expansion method and device based on word2vec
Technical field
The present invention relates to a kind of semantic query expansion method and device based on word2vec belongs to information retrieval technique neck Domain.
Background technology
Query expansion technology is a major issue of information retrieval field.In current information retrieval model and system In, information is stored in the form of word, word or phrase, after user gives an inquiry, only when in query set Query word occur in a document when, be possible to retrieve relevant document.It is same but in the natural language of the mankind Often there are many kinds of different expression ways for concept, such as when lookup automobile, if without extension, those Including car, sedan, Ford etc. are very high from user's original inquiry degree of correlation but can not be retrieved since word is different, from And user is made to be unable to get satisfied result.Just because of the presence of this query word mismatch problem, user sometimes has to Range query word can just find required information, so in order to mitigate this burden of user, need information retrieval system certainly Some other words associated with the query of dynamic selection carry out nonproductive poll, i.e., this word is not solved by query expansion technology not With the problem of.
User submit one inquiry, search engine in order to improve the retrieval satisfaction of user, usually using query expansion as One essential module, currently used enquiry expanding method are mainly the following:
1, the enquiry expanding method based on semantic knowledge dictionary:
Method based on semantic knowledge dictionary is mainly by languages such as WordNet, HowNet or other Chinese thesaurus Adopted knowledge dictionary, selects that there are the words of certain semantic relevance to be extended with query word, and the foundation of this method is usually The upper hyponym of query word, synonym etc., the method is too dependent on complete semantic system, and independently of language to be retrieved Material collection, therefore the expansion word elected is generally difficult to the characteristic of reflection corpus, it is difficult to the inquiry effect obtained.
2, the query expansion based on global analysis:
Global analysis be first in whole documents word or phrase carry out correlation analysis, calculate the association journey of each pair of word Degree, then will be added to again with the highest word of query word relevance and generates new inquiry in initial query.The advantages of this method It is the relationship that can be sought to greatest extent between word, inquiry expansion can be carried out with higher efficiency especially after establishing dictionary Exhibition;Unfortunately when document sets are very big, establish whole word relationship dictionaries is whether all often in the time or spatially It is less feasible, and newer cost is more huge if document sets change.
3, the query expansion based on partial analysis:
Partial analysis method is mainly to solve scaling problem using the method for quadratic search, straight using inquiry given for the first time Retrieval is connect, the source as expansion word with the former maximally related n document of inquiry is obtained, is looked in this n document with former inquiry most Relevant word, which is added to, establishes new inquiry in initial query.The current popular query expansion side based on partial analysis Method is pseudo-linear filter, it grows up on the basis of relevant feedback, and the difference of both feedbacks is relevant feedback The result of preliminary search is needed to be judged by user, the relevant documentation that user is thought is as the source of expansion word, and spurious correlation Feedback need not be interacted with user, and preceding n documents of return are directly considered related article.Although partial analysis method is mesh Preceding most widely used enquiry expanding method, but it be disadvantageous in that the document when preliminary search comes front with original It when the inquiry degree of correlation is little, is easy a large amount of unrelated words inquiry is added, causes " inquiring drift " problem.
With the proposition of the semantic models such as Word2Vec, Glove, word embedded technology is in the more of natural language processing in recent years A field causes the concern of numerous researchers.The term vector trained by word2vec, Glove training pattern provided The semanteme and grammatical relation in natural language are reflected, can be judged between lexical item by calculating the cosine value between term vector Similitude, therefore query expansion can be used for well.
Currently based on the research work of the query expansion of Word2Vec, but most work have following main two not more Foot:
(1) when structure extends vocabulary, only choose with the relevant word of query word as expansion word, without consider with it is whole The correlation of a inquiry.
(2) considering the work with the correlation entirely inquired to think that query vector is for all substitutes more It is changeless, therefore its query vector is mostly the simple adduction or mean value of each inquiry term vector.
But under normal conditions, for some expansion word of query word q, influence of other query words to the expansion word is not answered Influence with q to the expansion word is suitable.Centered on word different in inquiry word generate different query vectors thought it is extensive Applied to the semantic information retrieval field for other word-based insertions such as disambiguating and better effect is achieved, but there has been no effect use In query expansion field.
Invention content
The technical problem to be solved in the present invention is to provide a kind of semantic query expansion method and device based on word2vec, Purpose is to build and inquires the higher extension vocabulary of correlation, and relevant document is inquired with user to more fully return.
The technical scheme is that:A kind of semantic query expansion method based on word2vec, including:
Inquiry and document pre-treatment step:Inquiry participle, the removal stop words submitted for user, extract user's inquiry Keyword and carry out stem reduction, composition inquiry Q;Same pretreatment is done to document sets and obtains document sets D;
The selecting step of expansion word Candidate Set:For the inquiry Q after pretreatment, using based on word2vec model trainings Term vector calculate and obtain the most like lexical items of n of each searching keyword, constitute expansion word Candidate Set C
Establish extension vocabulary step:To each lexical item in C, it is calculated and the similarity entirely inquired, choose similarity Highest k expansion word extends vocabulary T to construct;
Establish document sets inverted index step:Inverted index is established to the document sets D after pretreatment;
Query expansion step:The inquiry after extension and the degree of correlation of the document in corresponding inverted index are calculated, according to correlation Degree is ranked up document.
The inquiry and document pre-treatment step, specifically includes following steps:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q;
(4) same pretreatment is done to document sets and generates new document sets D.
The expansion word Candidate Set selecting step, specifically includes following steps:
(1) corpus is given, term vector is trained by the training pattern that word2vec is provided.Term vector is more than one group The real number value vector of dimension, vector reflect semanteme and grammatical relation in natural language, thus can by calculate term vector it Between cosine value judge the similitude between lexical item;
(2) after obtaining term vector, to each keyword q in Qi, calculate and obtain by the cosine similarity of term vector With qiN most like word constitutes the expansion word Candidate Set of inquiry.
The establishment step of the extension vocabulary, specifically includes following steps:
(1) the inquiry Q that above-mentioned processing is formed, to each keyword q in Qi, it is opposite that a Q is generated as follows In qiQuery vector
Vec (q in formulai) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
For the candidate expansion word of different query words, using different query vectorsIt calculates expansion word and looks into The similarity of Q is ask, therefore the present invention will generate query vectorMethod be referred to as the query vector generation side towards expansion word Method, correspondingly,The also referred to as query vector towards expansion word;
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to expanding Exhibition root is resequenced according to similarity, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
Described establishes document sets inverted index step, specifically includes following steps:
(1) to pretreated document sets D, all words and duplicate removal of D is counted, document word set V is generated;
(2) to each lexical item v in V, ID (d of the construction one by all document d (wherein d ∈ D) comprising vid) and v The occurrence number tf in dv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, institute There is the set of Inverted List to constitute inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein | D | indicate the total quantity of document in D.
The query expansion document step, specifically includes following steps:
(1) (1) is to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, Remember that the collection of these Inverted Lists is combined into
(2) to appearing inIn each document d, add up itsIn each list tf-idf scores, obtain QexpWith Degree of correlation R (the Q of document dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of text is inquired with former to return Shelves.
A kind of semantic query expanding unit based on word2vec, including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words Inquiry Q and document sets D is formed with processing such as stem reduction;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, is instructed using based on word2vec models Experienced term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the phase for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring Like degree, some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module is obtained for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending Relevant documentation.
The beneficial effects of the invention are as follows:It proposes the semantic query expansion method based on word2vec, considers substitute to whole The similarity of a inquiry, and the query vector generation method towards expansion word is introduced, it is the corresponding expansion word word of different query words Different query vectors is generated, obtain and inquires the higher expansion-word set of correlation, and then obtains better query expansion effect.
Description of the drawings
Fig. 1 is that the present invention is based on the functional block diagrams that the semantic query of word2vec extends;
Fig. 2 is the expansion word Candidate Set figure of each keyword in query set of the present invention;
Fig. 3 is inverted index collection figure of the present invention.
Specific implementation mode
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figs. 1-3, a kind of semantic query expansion method based on word2vec, including:
Inquiry and document pre-treatment step:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q.
(4) same pretreatment is done to document sets and generates new document sets D.
Example 1:Inquiry pretreatment:Assuming that the inquiry that user submits is " problems associated with high speed aircraft”
(1) inquiry submitted first to user segments, and the inquiry after segmenting is shown as:Problems, Associated, with, high, speed, aircraft };
(2) stop words is removed, the noun then chosen in inquiry constitutes final inquiry, and inquiry is shown as: { problems, speed, aircraft };
(3) stem reduction is carried out to the keyword in inquiry, problems is noun plurality, the searching keyword after reduction Collect Q={ problem, speed, aircraft }.
Example 2:Document sets pre-process:Assuming that the document sets being made of following four documents:
D0=" The main problem limiting the high velocity performance of helicopter is resistance"
D1=" high altitude and high speed flying aircraft are often more slender shape"
D2=" There are many airplanes in the sky that make up a row "
D3=" whether to fly today is a problem "
All words in character string are found out by space and separator, remove stop words and carry out stem reduction, formation New document sets are:
D0=" problem, limit, velocity, performance, helicopter, resistance "
D1=" altitude, speed, fly, aircraft, slender, shape "
D2=" airplane, sky, row "
D3=" fly, problem "
Choose expansion word Candidate Set step:
(1) wikipedia corpus is selected, the term vector text of 200 dimensions is gone out by the word2vec CBOW model trainings provided Part;
(2) after obtaining term vector, to each keyword in Q, the cosine similarity by calculating term vector obtains n Most like word, the expansion word Candidate Set as inquiry.
For each keyword in inquiry Q={ problem, speed, aircraft }, pass through trained term vector The case where selection maximally related expansion word of preceding 10 semantemes, expansion word Candidate Set, is as shown in Figure 3.
Construction extension vocabulary T steps:
(1) to each keyword q in Qi, a Q is generated as follows relative to qiQuery vector
Vec (q in formulai) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to phase It resequences like degree, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
Example:
(1) 200 dimension term vectors of each keyword in inquiry Q can be obtained according to trained term vector first:
Vec (problem)=[0.29686138,1.71120727 ..., -0.6585713, -1.86508703]
Vec (speed)=[- 2.00363445,1.05960512 ..., -0.475373, -4.39991331]
Vec (aircraft)=[- 3.54158616,3.28720021 ..., -2.34602952, -3.29022384]
Then it is as follows that each query vector of the keyword towards expansion word, calculating process in Q are calculated:
2) for inquiring the keyword aircraft in Q, i.e. q3=aircraft calculates q3Each expansion word t with look into Ask the similarity of Q:
........
(3) and so on, the similarity of each expansion word and former inquiry Q in Fig. 2 is calculated, then according to similarity to candidate The expansion word of concentration is ranked up, and obtains and inquire k most like expansion word of Q, by taking k=4 as an example, finally obtained expansion word Table T is as follows:
T={ helicopter, airplane, velocity, altitude }
(4) query word and expansion word are merged, be expanded inquiry Qexp
Qexp=Q ∪ T
={ problem, speed, aircraft } ∪ { helicopter, airplane, velocity, altitude }
={ problem, speed, aircraft, helicopter, airplane, velocity, altitude }
The foundation of document sets inverted index includes the following steps:
(1) to pretreated document sets D, the independent lexical item in D is counted, generates vocabulary V;
(2) to each lexical item v in V, ID (d of the construction one by all document d (wherein d ∈ D) comprising vid) and v The occurrence number tf in dv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, institute There is the set of Inverted List to constitute inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein | D | indicate the total quantity of document in D.
Example:
(1) document sets obtain following document sets D after segmenting, going the pretreatments such as stop words:
D0=" problem, limit, velocity, performance, helicopter, resistance "
D1=" altitude, speed, fly, aircraft, slender, shape "
D2=" airplane, sky, row "
D3=" fly, problem "
The independent lexical item in D is counted, vocabulary V is generated:
V=altitude, speed, fly, aircraft, slender, shape, problem, limit, velocity, performance,
helicopter,resistance,airplane,sky,row}
(2) by taking word velocity in vocabulary V as an example, traversal document sets D, which finds the document comprising velocity, D1, Record its ID=D1, it is counted in document D1The number of middle appearance is 1, then the representation of the Inverted List of velocity is < D1, 1 >;Calculating that the rest may be inferred and the set for establishing the Inverted List of all lexical items in V, constitute inverted index collection I;
(3) to each word v in V, the number of documents m (i.e. the Inverted List length of v) of its appearance is counted, calculates idf Score:
Such as v=velocity, Inverted List length is 1, i.e., the document comprising problem only has 1 in document sets, m= 1, therefore the idf scores of word velocity are calculated as:
The idf scores of all words are calculated according to this, and record idf in the index, final inverted index collection I such as Fig. 3 institutes Show.
Query expansion step:
(1) to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, remember The collection of these Inverted Lists is combined into
(2) to appearing inIn each document d, add up itsIn each list tf-idf scores, obtain QexpWith Degree of correlation R (the Q of document dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of text is inquired with former to return Shelves.
Example:
(1) to the Q of above-mentioned generationexp, the inverted index collection of query graph 3, acquisition QexpIn all keywords it is corresponding fall arrange Union I is sought in listQexp
IQexp=I (problem) ∪ I (speed) ∪ ... ∪ I (airplane) ∪ I (altitude)
={ D1,D3}∪{D0}∪......∪{D2}∪{D0}
={ D0,D1,D2,D3}
(2) to D0,D1,D2And D3Number document calculates QexpDegree R (Q associated therewithexp, d), wherein enabling adjustment parameter λ herein =0.6, calculating process is as follows:
(3) these documents are ranked up according to the size of the degree of correlation, there is D1> D0> D2> D3;If N=3 returns to D1, D0,D2Number document.
Embodiment 2:A kind of semantic query expanding unit based on word2vec, including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words Inquiry Q and document sets D is formed with processing such as stem reduction;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, is instructed using based on word2vec models Experienced term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the phase for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring Like degree, some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module is obtained for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending Relevant documentation.
The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (7)

1. a kind of semantic query expansion method based on word2vec, it is characterised in that:It the described method comprises the following steps:
(1) inquiry and document pretreatment:Inquiry participle, the removal stop words submitted for user, extract the pass of user's inquiry Keyword simultaneously carries out stem reduction, composition inquiry Q;Same pretreatment is done to document sets and obtains document sets D;
(2) selection of expansion word Candidate Set:For the inquiry Q after pretreatment, the word based on word2vec model trainings is utilized Vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
(3) extension vocabulary is established:To each lexical item in C, it is calculated and the similarity entirely inquired, it is highest to choose similarity K expansion word extends vocabulary T to construct;
(4) document sets inverted index is established:Inverted index is established to the document sets D after pretreatment;
(5) query expansion:The inquiry after extension and the degree of correlation of the document in corresponding inverted index are calculated, according to the degree of correlation to text Shelves are ranked up.
2. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Inquiry and document Pre-treatment step specifically includes following steps:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q;
(4) same pretreatment is done to document sets and generates new document sets D.
3. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Expansion word is candidate The selecting step of collection, specifically includes following steps:
(1) corpus is given, term vector is trained by the training pattern that word2vec is provided, term vector is one group of multidimensional Real number value vector, vector reflect semanteme and grammatical relation in natural language, therefore can be by between calculating term vector Cosine value judges the similitude between lexical item;
(2) after obtaining term vector, to each keyword q in Qi, calculated and obtained and q by the cosine similarity of term vectoriMost Similar n word constitutes the expansion word Candidate Set of inquiry.
4. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Extend vocabulary Establishment step specifically includes following steps:
(1) the inquiry Q that above-mentioned processing is formed, to each keyword q in Qi, a Q is generated as follows relative to qi's Query vector vec (Qqi):
In formula, vec (qi) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
Sim (t, Q)=cos (vec (t), vec (Qqi))
For the candidate expansion word of different query words, using different query vector vec (Qqi) calculate expansion word and inquire Q's Similarity will generate query vector vec (Qqi) method be referred to as the query vector generation method towards expansion word, correspondingly, vec (Qqi) it is also referred to as the query vector towards expansion word;
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to expansion word It is resequenced according to similarity, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
5. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Establish document sets Inverted index specifically includes following steps:
(1) to pretreated document sets D, all words and duplicate removal of D is counted, document word set V is generated;
(2) to each lexical item v in V, construction one is by all document d comprising v, the wherein ID (d of d ∈ Did) and v in d Occurrence number tfv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, all rows of falling The set of list constitutes inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein, | D | indicate the total quantity of document in D.
6. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Query expansion has Body includes the following steps:
(1) to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, remember these The collection of Inverted List is combined into IQexp
(2) to appearing in IQexpIn each document d, add up its in IQexpIn each list tf-idf scores, obtain QexpWith text Degree of correlation R (the Q of shelves dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of document is inquired with former to return.
7. a kind of semantic query expanding unit based on word2vec, it is characterised in that including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words and word The processing such as dry reduction form inquiry Q and document sets D;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, using based on word2vec model trainings Term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the similarity for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring, Some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module obtains related for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending Document.
CN201810179478.3A 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec Active CN108491462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810179478.3A CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810179478.3A CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Publications (2)

Publication Number Publication Date
CN108491462A true CN108491462A (en) 2018-09-04
CN108491462B CN108491462B (en) 2021-09-14

Family

ID=63341204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810179478.3A Active CN108491462B (en) 2018-03-05 2018-03-05 Semantic query expansion method and device based on word2vec

Country Status (1)

Country Link
CN (1) CN108491462B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063203A (en) * 2018-09-14 2018-12-21 河海大学 A kind of query word extended method based on personalized model
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109885766A (en) * 2019-02-11 2019-06-14 武汉理工大学 A kind of books recommended method and system based on book review
CN110008407A (en) * 2019-04-09 2019-07-12 苏州浪潮智能科技有限公司 A kind of information retrieval method and device
CN110188204A (en) * 2019-06-11 2019-08-30 腾讯科技(深圳)有限公司 A kind of extension corpora mining method, apparatus, server and storage medium
CN110196977A (en) * 2019-05-31 2019-09-03 广西南宁市博睿通软件技术有限公司 A kind of intelligence alert inspection processing system and method
CN110489526A (en) * 2019-08-13 2019-11-22 上海市儿童医院 A kind of term extended method, device and storage medium for medical retrieval
CN110909116A (en) * 2019-11-28 2020-03-24 中国人民解放军军事科学院军事科学信息研究中心 Entity set expansion method and system for social media
WO2020062770A1 (en) * 2018-09-27 2020-04-02 深圳大学 Method and apparatus for constructing domain dictionary, and device and storage medium
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112199461A (en) * 2020-09-17 2021-01-08 暨南大学 Document retrieval method, device, medium and equipment based on block index structure
WO2021032824A1 (en) * 2019-08-20 2021-02-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. Method and device for pre-selecting and determining similar documents
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN112949304A (en) * 2021-03-24 2021-06-11 中新国际联合研究院 Construction case knowledge reuse query method and device
CN113033197A (en) * 2021-03-24 2021-06-25 中新国际联合研究院 Building construction contract rule query method and device
CN113486067A (en) * 2021-07-16 2021-10-08 用友网络科技股份有限公司 Information query method, system and readable storage medium
CN114661852A (en) * 2020-12-23 2022-06-24 深圳市万普拉斯科技有限公司 Text searching method, terminal and readable storage medium
CN114723008A (en) * 2022-04-01 2022-07-08 北京健康之家科技有限公司 Language representation model training method, device, equipment, medium and user response method
CN116340470A (en) * 2023-05-30 2023-06-27 环球数科集团有限公司 Keyword associated retrieval system based on AIGC

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
US9798820B1 (en) * 2016-10-28 2017-10-24 Searchmetrics Gmbh Classification of keywords
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system
US20180004815A1 (en) * 2015-12-01 2018-01-04 Huawei Technologies Co., Ltd. Stop word identification method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765769A (en) * 2015-03-06 2015-07-08 大连理工大学 Short text query expansion and indexing method based on word vector
CN104778161A (en) * 2015-04-30 2015-07-15 车智互联(北京)科技有限公司 Keyword extracting method based on Word2Vec and Query log
US20180004815A1 (en) * 2015-12-01 2018-01-04 Huawei Technologies Co., Ltd. Stop word identification method and apparatus
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
US9798820B1 (en) * 2016-10-28 2017-10-24 Searchmetrics Gmbh Classification of keywords
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHANG LIFENG等: "Behavior Targeting Based on Hierarchical Taxonomy Aggregation for Heterogeneous Online Shopping Applications", 《ZTE COMMUNICATIONS》 *
徐康: "基于用户兴趣模型的个性化搜索排序研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
许侃等: "专利查询扩展的词向量方法研究", 《计算机科学与探索》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063203A (en) * 2018-09-14 2018-12-21 河海大学 A kind of query word extended method based on personalized model
CN109063203B (en) * 2018-09-14 2020-07-24 河海大学 Query term expansion method based on personalized model
WO2020062770A1 (en) * 2018-09-27 2020-04-02 深圳大学 Method and apparatus for constructing domain dictionary, and device and storage medium
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109885766A (en) * 2019-02-11 2019-06-14 武汉理工大学 A kind of books recommended method and system based on book review
CN110008407A (en) * 2019-04-09 2019-07-12 苏州浪潮智能科技有限公司 A kind of information retrieval method and device
CN110008407B (en) * 2019-04-09 2021-05-04 苏州浪潮智能科技有限公司 Information retrieval method and device
CN110196977A (en) * 2019-05-31 2019-09-03 广西南宁市博睿通软件技术有限公司 A kind of intelligence alert inspection processing system and method
CN110196977B (en) * 2019-05-31 2023-06-09 广西南宁市博睿通软件技术有限公司 Intelligent warning condition supervision processing system and method
CN110188204A (en) * 2019-06-11 2019-08-30 腾讯科技(深圳)有限公司 A kind of extension corpora mining method, apparatus, server and storage medium
CN110188204B (en) * 2019-06-11 2022-10-04 腾讯科技(深圳)有限公司 Extended corpus mining method and device, server and storage medium
CN110489526A (en) * 2019-08-13 2019-11-22 上海市儿童医院 A kind of term extended method, device and storage medium for medical retrieval
WO2021032824A1 (en) * 2019-08-20 2021-02-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. Method and device for pre-selecting and determining similar documents
CN110909116A (en) * 2019-11-28 2020-03-24 中国人民解放军军事科学院军事科学信息研究中心 Entity set expansion method and system for social media
CN110909116B (en) * 2019-11-28 2022-12-23 中国人民解放军军事科学院军事科学信息研究中心 Entity set expansion method and system for social media
CN111897928A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN112199461B (en) * 2020-09-17 2022-05-31 暨南大学 Document retrieval method, device, medium and equipment based on block index structure
CN112199461A (en) * 2020-09-17 2021-01-08 暨南大学 Document retrieval method, device, medium and equipment based on block index structure
CN114661852A (en) * 2020-12-23 2022-06-24 深圳市万普拉斯科技有限公司 Text searching method, terminal and readable storage medium
CN112836008B (en) * 2021-02-07 2023-03-21 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN113033197A (en) * 2021-03-24 2021-06-25 中新国际联合研究院 Building construction contract rule query method and device
CN112949304A (en) * 2021-03-24 2021-06-11 中新国际联合研究院 Construction case knowledge reuse query method and device
CN113486067A (en) * 2021-07-16 2021-10-08 用友网络科技股份有限公司 Information query method, system and readable storage medium
CN114723008A (en) * 2022-04-01 2022-07-08 北京健康之家科技有限公司 Language representation model training method, device, equipment, medium and user response method
CN116340470A (en) * 2023-05-30 2023-06-27 环球数科集团有限公司 Keyword associated retrieval system based on AIGC
CN116340470B (en) * 2023-05-30 2023-09-15 环球数科集团有限公司 Keyword associated retrieval system based on AIGC

Also Published As

Publication number Publication date
CN108491462B (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN108491462A (en) A kind of semantic query expansion method and device based on word2vec
Yin et al. Ranking relevance in yahoo search
Carpineto et al. A survey of automatic query expansion in information retrieval
Moawad et al. Bi-gram term collocations-based query expansion approach for improving Arabic information retrieval
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
Madnani et al. Multiple alternative sentence compressions for automatic text summarization
El Mahdaouy et al. Semantically enhanced term frequency based on word embeddings for Arabic information retrieval
Grineva et al. Blognoon: Exploring a topic in the blogosphere
Kanwal et al. Adaptively intelligent meta-search engine with minimum edit distance
Pasca Open-domain fine-grained class extraction from web search queries
CN113642325A (en) Text keyword extraction method fusing text structure information and semantic information
Artese et al. What is this painting about? Experiments on Unsupervised Keyphrases Extraction algorithms
Gulati et al. Ontology driven query expansion for better image retrieval
Manjula et al. Semantic search engine
Wang et al. Exploiting semantic knowledge base for patent retrieval
CN106708808B (en) Information mining method and device
Liu et al. A query suggestion method based on random walk and topic concepts
Nwesri et al. Applying Arabic stemming using query expansion
CN114186075B (en) Semantic search method for knowledge graph in cultural domain
Martínez et al. Evaluation of MIRACLE approach results for CLEF 2003
Qin et al. Expansion model of semantic query based on ontology
Reddy et al. Cross lingual information retrieval using search engine and data mining
Khalid et al. BERT-embedding and citation network analysis based query expansion technique for scholarly search
Larson Experiments in classification clustering and thesaurus expansion for domain specific cross-language retrieval
Ramakrishna et al. Information retrieval in Telugu language using synset relationships

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant