CN108491462A - A kind of semantic query expansion method and device based on word2vec - Google Patents
A kind of semantic query expansion method and device based on word2vec Download PDFInfo
- Publication number
- CN108491462A CN108491462A CN201810179478.3A CN201810179478A CN108491462A CN 108491462 A CN108491462 A CN 108491462A CN 201810179478 A CN201810179478 A CN 201810179478A CN 108491462 A CN108491462 A CN 108491462A
- Authority
- CN
- China
- Prior art keywords
- word
- expansion
- inquiry
- query
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 49
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000002203 pretreatment Methods 0.000 claims abstract description 6
- 230000011218 segmentation Effects 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 5
- 238000004458 analytical method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013549 information retrieval technique Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of semantic query expansion method and device based on word2vec, belongs to technical field of information retrieval.The method of the present invention includes:The pre-treatment step of the given inquiry of user:Word segmentation processing is carried out to inquiry, remove stop words and carries out stem reduction;Expansion word Candidate Set selecting step:Initial extension word is chosen using word2vec tools;Establish extension vocabulary step:Expansion word Candidate Set is filtered, true extension vocabulary is established;Query expansion step:User's inquiry and its expansion word are matched with indexed set, return to relevant documentation and sorted.The present invention proposes a kind of query vector generation method towards expansion word to filter candidate expansion word and build extension vocabulary, to the correlation for preferably embodying expansion word and entirely inquiring, and then the effect of raising query expansion.
Description
Technical field
The present invention relates to a kind of semantic query expansion method and device based on word2vec belongs to information retrieval technique neck
Domain.
Background technology
Query expansion technology is a major issue of information retrieval field.In current information retrieval model and system
In, information is stored in the form of word, word or phrase, after user gives an inquiry, only when in query set
Query word occur in a document when, be possible to retrieve relevant document.It is same but in the natural language of the mankind
Often there are many kinds of different expression ways for concept, such as when lookup automobile, if without extension, those
Including car, sedan, Ford etc. are very high from user's original inquiry degree of correlation but can not be retrieved since word is different, from
And user is made to be unable to get satisfied result.Just because of the presence of this query word mismatch problem, user sometimes has to
Range query word can just find required information, so in order to mitigate this burden of user, need information retrieval system certainly
Some other words associated with the query of dynamic selection carry out nonproductive poll, i.e., this word is not solved by query expansion technology not
With the problem of.
User submit one inquiry, search engine in order to improve the retrieval satisfaction of user, usually using query expansion as
One essential module, currently used enquiry expanding method are mainly the following:
1, the enquiry expanding method based on semantic knowledge dictionary:
Method based on semantic knowledge dictionary is mainly by languages such as WordNet, HowNet or other Chinese thesaurus
Adopted knowledge dictionary, selects that there are the words of certain semantic relevance to be extended with query word, and the foundation of this method is usually
The upper hyponym of query word, synonym etc., the method is too dependent on complete semantic system, and independently of language to be retrieved
Material collection, therefore the expansion word elected is generally difficult to the characteristic of reflection corpus, it is difficult to the inquiry effect obtained.
2, the query expansion based on global analysis:
Global analysis be first in whole documents word or phrase carry out correlation analysis, calculate the association journey of each pair of word
Degree, then will be added to again with the highest word of query word relevance and generates new inquiry in initial query.The advantages of this method
It is the relationship that can be sought to greatest extent between word, inquiry expansion can be carried out with higher efficiency especially after establishing dictionary
Exhibition;Unfortunately when document sets are very big, establish whole word relationship dictionaries is whether all often in the time or spatially
It is less feasible, and newer cost is more huge if document sets change.
3, the query expansion based on partial analysis:
Partial analysis method is mainly to solve scaling problem using the method for quadratic search, straight using inquiry given for the first time
Retrieval is connect, the source as expansion word with the former maximally related n document of inquiry is obtained, is looked in this n document with former inquiry most
Relevant word, which is added to, establishes new inquiry in initial query.The current popular query expansion side based on partial analysis
Method is pseudo-linear filter, it grows up on the basis of relevant feedback, and the difference of both feedbacks is relevant feedback
The result of preliminary search is needed to be judged by user, the relevant documentation that user is thought is as the source of expansion word, and spurious correlation
Feedback need not be interacted with user, and preceding n documents of return are directly considered related article.Although partial analysis method is mesh
Preceding most widely used enquiry expanding method, but it be disadvantageous in that the document when preliminary search comes front with original
It when the inquiry degree of correlation is little, is easy a large amount of unrelated words inquiry is added, causes " inquiring drift " problem.
With the proposition of the semantic models such as Word2Vec, Glove, word embedded technology is in the more of natural language processing in recent years
A field causes the concern of numerous researchers.The term vector trained by word2vec, Glove training pattern provided
The semanteme and grammatical relation in natural language are reflected, can be judged between lexical item by calculating the cosine value between term vector
Similitude, therefore query expansion can be used for well.
Currently based on the research work of the query expansion of Word2Vec, but most work have following main two not more
Foot:
(1) when structure extends vocabulary, only choose with the relevant word of query word as expansion word, without consider with it is whole
The correlation of a inquiry.
(2) considering the work with the correlation entirely inquired to think that query vector is for all substitutes more
It is changeless, therefore its query vector is mostly the simple adduction or mean value of each inquiry term vector.
But under normal conditions, for some expansion word of query word q, influence of other query words to the expansion word is not answered
Influence with q to the expansion word is suitable.Centered on word different in inquiry word generate different query vectors thought it is extensive
Applied to the semantic information retrieval field for other word-based insertions such as disambiguating and better effect is achieved, but there has been no effect use
In query expansion field.
Invention content
The technical problem to be solved in the present invention is to provide a kind of semantic query expansion method and device based on word2vec,
Purpose is to build and inquires the higher extension vocabulary of correlation, and relevant document is inquired with user to more fully return.
The technical scheme is that:A kind of semantic query expansion method based on word2vec, including:
Inquiry and document pre-treatment step:Inquiry participle, the removal stop words submitted for user, extract user's inquiry
Keyword and carry out stem reduction, composition inquiry Q;Same pretreatment is done to document sets and obtains document sets D;
The selecting step of expansion word Candidate Set:For the inquiry Q after pretreatment, using based on word2vec model trainings
Term vector calculate and obtain the most like lexical items of n of each searching keyword, constitute expansion word Candidate Set C
Establish extension vocabulary step:To each lexical item in C, it is calculated and the similarity entirely inquired, choose similarity
Highest k expansion word extends vocabulary T to construct;
Establish document sets inverted index step:Inverted index is established to the document sets D after pretreatment;
Query expansion step:The inquiry after extension and the degree of correlation of the document in corresponding inverted index are calculated, according to correlation
Degree is ranked up document.
The inquiry and document pre-treatment step, specifically includes following steps:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q;
(4) same pretreatment is done to document sets and generates new document sets D.
The expansion word Candidate Set selecting step, specifically includes following steps:
(1) corpus is given, term vector is trained by the training pattern that word2vec is provided.Term vector is more than one group
The real number value vector of dimension, vector reflect semanteme and grammatical relation in natural language, thus can by calculate term vector it
Between cosine value judge the similitude between lexical item;
(2) after obtaining term vector, to each keyword q in Qi, calculate and obtain by the cosine similarity of term vector
With qiN most like word constitutes the expansion word Candidate Set of inquiry.
The establishment step of the extension vocabulary, specifically includes following steps:
(1) the inquiry Q that above-mentioned processing is formed, to each keyword q in Qi, it is opposite that a Q is generated as follows
In qiQuery vector
Vec (q in formulai) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
For the candidate expansion word of different query words, using different query vectorsIt calculates expansion word and looks into
The similarity of Q is ask, therefore the present invention will generate query vectorMethod be referred to as the query vector generation side towards expansion word
Method, correspondingly,The also referred to as query vector towards expansion word;
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to expanding
Exhibition root is resequenced according to similarity, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
Described establishes document sets inverted index step, specifically includes following steps:
(1) to pretreated document sets D, all words and duplicate removal of D is counted, document word set V is generated;
(2) to each lexical item v in V, ID (d of the construction one by all document d (wherein d ∈ D) comprising vid) and v
The occurrence number tf in dv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, institute
There is the set of Inverted List to constitute inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein | D | indicate the total quantity of document in D.
The query expansion document step, specifically includes following steps:
(1) (1) is to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword,
Remember that the collection of these Inverted Lists is combined into
(2) to appearing inIn each document d, add up itsIn each list tf-idf scores, obtain QexpWith
Degree of correlation R (the Q of document dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of text is inquired with former to return
Shelves.
A kind of semantic query expanding unit based on word2vec, including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words
Inquiry Q and document sets D is formed with processing such as stem reduction;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, is instructed using based on word2vec models
Experienced term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the phase for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring
Like degree, some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module is obtained for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending
Relevant documentation.
The beneficial effects of the invention are as follows:It proposes the semantic query expansion method based on word2vec, considers substitute to whole
The similarity of a inquiry, and the query vector generation method towards expansion word is introduced, it is the corresponding expansion word word of different query words
Different query vectors is generated, obtain and inquires the higher expansion-word set of correlation, and then obtains better query expansion effect.
Description of the drawings
Fig. 1 is that the present invention is based on the functional block diagrams that the semantic query of word2vec extends;
Fig. 2 is the expansion word Candidate Set figure of each keyword in query set of the present invention;
Fig. 3 is inverted index collection figure of the present invention.
Specific implementation mode
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figs. 1-3, a kind of semantic query expansion method based on word2vec, including:
Inquiry and document pre-treatment step:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q.
(4) same pretreatment is done to document sets and generates new document sets D.
Example 1:Inquiry pretreatment:Assuming that the inquiry that user submits is " problems associated with high
speed aircraft”
(1) inquiry submitted first to user segments, and the inquiry after segmenting is shown as:Problems,
Associated, with, high, speed, aircraft };
(2) stop words is removed, the noun then chosen in inquiry constitutes final inquiry, and inquiry is shown as:
{ problems, speed, aircraft };
(3) stem reduction is carried out to the keyword in inquiry, problems is noun plurality, the searching keyword after reduction
Collect Q={ problem, speed, aircraft }.
Example 2:Document sets pre-process:Assuming that the document sets being made of following four documents:
D0=" The main problem limiting the high velocity performance of
helicopter is resistance"
D1=" high altitude and high speed flying aircraft are often more
slender shape"
D2=" There are many airplanes in the sky that make up a row "
D3=" whether to fly today is a problem "
All words in character string are found out by space and separator, remove stop words and carry out stem reduction, formation
New document sets are:
D0=" problem, limit, velocity, performance, helicopter, resistance "
D1=" altitude, speed, fly, aircraft, slender, shape "
D2=" airplane, sky, row "
D3=" fly, problem "
Choose expansion word Candidate Set step:
(1) wikipedia corpus is selected, the term vector text of 200 dimensions is gone out by the word2vec CBOW model trainings provided
Part;
(2) after obtaining term vector, to each keyword in Q, the cosine similarity by calculating term vector obtains n
Most like word, the expansion word Candidate Set as inquiry.
For each keyword in inquiry Q={ problem, speed, aircraft }, pass through trained term vector
The case where selection maximally related expansion word of preceding 10 semantemes, expansion word Candidate Set, is as shown in Figure 3.
Construction extension vocabulary T steps:
(1) to each keyword q in Qi, a Q is generated as follows relative to qiQuery vector
Vec (q in formulai) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to phase
It resequences like degree, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
Example:
(1) 200 dimension term vectors of each keyword in inquiry Q can be obtained according to trained term vector first:
Vec (problem)=[0.29686138,1.71120727 ..., -0.6585713, -1.86508703]
Vec (speed)=[- 2.00363445,1.05960512 ..., -0.475373, -4.39991331]
Vec (aircraft)=[- 3.54158616,3.28720021 ..., -2.34602952, -3.29022384]
Then it is as follows that each query vector of the keyword towards expansion word, calculating process in Q are calculated:
2) for inquiring the keyword aircraft in Q, i.e. q3=aircraft calculates q3Each expansion word t with look into
Ask the similarity of Q:
........
(3) and so on, the similarity of each expansion word and former inquiry Q in Fig. 2 is calculated, then according to similarity to candidate
The expansion word of concentration is ranked up, and obtains and inquire k most like expansion word of Q, by taking k=4 as an example, finally obtained expansion word
Table T is as follows:
T={ helicopter, airplane, velocity, altitude }
(4) query word and expansion word are merged, be expanded inquiry Qexp:
Qexp=Q ∪ T
={ problem, speed, aircraft } ∪ { helicopter, airplane, velocity, altitude }
={ problem, speed, aircraft, helicopter, airplane, velocity, altitude }
The foundation of document sets inverted index includes the following steps:
(1) to pretreated document sets D, the independent lexical item in D is counted, generates vocabulary V;
(2) to each lexical item v in V, ID (d of the construction one by all document d (wherein d ∈ D) comprising vid) and v
The occurrence number tf in dv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, institute
There is the set of Inverted List to constitute inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein | D | indicate the total quantity of document in D.
Example:
(1) document sets obtain following document sets D after segmenting, going the pretreatments such as stop words:
D0=" problem, limit, velocity, performance, helicopter, resistance "
D1=" altitude, speed, fly, aircraft, slender, shape "
D2=" airplane, sky, row "
D3=" fly, problem "
The independent lexical item in D is counted, vocabulary V is generated:
V=altitude, speed, fly, aircraft, slender, shape, problem, limit, velocity,
performance,
helicopter,resistance,airplane,sky,row}
(2) by taking word velocity in vocabulary V as an example, traversal document sets D, which finds the document comprising velocity, D1,
Record its ID=D1, it is counted in document D1The number of middle appearance is 1, then the representation of the Inverted List of velocity is <
D1, 1 >;Calculating that the rest may be inferred and the set for establishing the Inverted List of all lexical items in V, constitute inverted index collection I;
(3) to each word v in V, the number of documents m (i.e. the Inverted List length of v) of its appearance is counted, calculates idf
Score:
Such as v=velocity, Inverted List length is 1, i.e., the document comprising problem only has 1 in document sets, m=
1, therefore the idf scores of word velocity are calculated as:
The idf scores of all words are calculated according to this, and record idf in the index, final inverted index collection I such as Fig. 3 institutes
Show.
Query expansion step:
(1) to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, remember
The collection of these Inverted Lists is combined into
(2) to appearing inIn each document d, add up itsIn each list tf-idf scores, obtain QexpWith
Degree of correlation R (the Q of document dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of text is inquired with former to return
Shelves.
Example:
(1) to the Q of above-mentioned generationexp, the inverted index collection of query graph 3, acquisition QexpIn all keywords it is corresponding fall arrange
Union I is sought in listQexp:
IQexp=I (problem) ∪ I (speed) ∪ ... ∪ I (airplane) ∪ I (altitude)
={ D1,D3}∪{D0}∪......∪{D2}∪{D0}
={ D0,D1,D2,D3}
(2) to D0,D1,D2And D3Number document calculates QexpDegree R (Q associated therewithexp, d), wherein enabling adjustment parameter λ herein
=0.6, calculating process is as follows:
(3) these documents are ranked up according to the size of the degree of correlation, there is D1> D0> D2> D3;If N=3 returns to D1,
D0,D2Number document.
Embodiment 2:A kind of semantic query expanding unit based on word2vec, including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words
Inquiry Q and document sets D is formed with processing such as stem reduction;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, is instructed using based on word2vec models
Experienced term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the phase for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring
Like degree, some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module is obtained for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending
Relevant documentation.
The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (7)
1. a kind of semantic query expansion method based on word2vec, it is characterised in that:It the described method comprises the following steps:
(1) inquiry and document pretreatment:Inquiry participle, the removal stop words submitted for user, extract the pass of user's inquiry
Keyword simultaneously carries out stem reduction, composition inquiry Q;Same pretreatment is done to document sets and obtains document sets D;
(2) selection of expansion word Candidate Set:For the inquiry Q after pretreatment, the word based on word2vec model trainings is utilized
Vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
(3) extension vocabulary is established:To each lexical item in C, it is calculated and the similarity entirely inquired, it is highest to choose similarity
K expansion word extends vocabulary T to construct;
(4) document sets inverted index is established:Inverted index is established to the document sets D after pretreatment;
(5) query expansion:The inquiry after extension and the degree of correlation of the document in corresponding inverted index are calculated, according to the degree of correlation to text
Shelves are ranked up.
2. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Inquiry and document
Pre-treatment step specifically includes following steps:
(1) word segmentation processing is carried out by space character and punctuation mark to the inquiry that user submits;
(2) stop words is removed after participle, the word that those are not represented to concept filters out;
(3) stem reduction is carried out after removing stop words, generates inquiry Q;
(4) same pretreatment is done to document sets and generates new document sets D.
3. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Expansion word is candidate
The selecting step of collection, specifically includes following steps:
(1) corpus is given, term vector is trained by the training pattern that word2vec is provided, term vector is one group of multidimensional
Real number value vector, vector reflect semanteme and grammatical relation in natural language, therefore can be by between calculating term vector
Cosine value judges the similitude between lexical item;
(2) after obtaining term vector, to each keyword q in Qi, calculated and obtained and q by the cosine similarity of term vectoriMost
Similar n word constitutes the expansion word Candidate Set of inquiry.
4. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Extend vocabulary
Establishment step specifically includes following steps:
(1) the inquiry Q that above-mentioned processing is formed, to each keyword q in Qi, a Q is generated as follows relative to qi's
Query vector vec (Qqi):
In formula, vec (qi) indicate query word qiVector, sim (qi,qj) indicate qiAnd qjSimilarity.
(2) to qiEach of candidate expansion word t, calculate t as follows and inquire the similarity of Q:
Sim (t, Q)=cos (vec (t), vec (Qqi))
For the candidate expansion word of different query words, using different query vector vec (Qqi) calculate expansion word and inquire Q's
Similarity will generate query vector vec (Qqi) method be referred to as the query vector generation method towards expansion word, correspondingly, vec
(Qqi) it is also referred to as the query vector towards expansion word;
(3) expansion word of each query word calculates the similarity relative to entire inquiry Q according to model above, then to expansion word
It is resequenced according to similarity, the highest k expansion word of similarity is returned to, as final expansion-word set T;
(4) expanding query Q is generatedexp=Q ∪ T.
5. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Establish document sets
Inverted index specifically includes following steps:
(1) to pretreated document sets D, all words and duplicate removal of D is counted, document word set V is generated;
(2) to each lexical item v in V, construction one is by all document d comprising v, the wherein ID (d of d ∈ Did) and v in d
Occurrence number tfv,dThe Inverted List of composition, each item is expressed as two tuple < d in listid,tfv,dThe form of >, all rows of falling
The set of list constitutes inverted index collection I;
(3) to each lexical item v, the number of documents m of its appearance is counted, and calculates the idf scores of v according to following formula:
Wherein, | D | indicate the total quantity of document in D.
6. the semantic query expansion method according to claim 1 based on word2vec, it is characterised in that:Query expansion has
Body includes the following steps:
(1) to QexpIn each keyword, inquire inverted index collection I, obtain the corresponding Inverted List of the keyword, remember these
The collection of Inverted List is combined into IQexp;
(2) to appearing in IQexpIn each document d, add up its in IQexpIn each list tf-idf scores, obtain QexpWith text
Degree of correlation R (the Q of shelves dexp, d), calculate R (Qexp, d) formula it is as follows:
In formula, λ indicates adjustment parameter, for controlling the weight of query word and expansion word when calculating the degree of correlation.
(3) these documents are ranked up according to the size of the degree of correlation, maximally related N number of document is inquired with former to return.
7. a kind of semantic query expanding unit based on word2vec, it is characterised in that including:
Inquiry and document sets preprocessing module, for being segmented to the inquiry of document sets and user's submission, removing stop words and word
The processing such as dry reduction form inquiry Q and document sets D;
Expansion word Candidate Set chooses module, for that will inquire each keyword in Q, using based on word2vec model trainings
Term vector calculates and obtains n most like lexical items of each searching keyword, constitutes expansion word Candidate Set C;
Vocabulary constructing module is extended, the similarity for each lexical item in expansion word Candidate Set, calculating it with entirely inquiring,
Some higher expansion words of similarity are chosen to construct extension vocabulary T;
Document sets inverted index module, for establishing inverted index to the document sets D after pretreatment;
Query expansion module obtains related for calculating inquiry and the degree of correlation of the document in corresponding inverted index after extending
Document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810179478.3A CN108491462B (en) | 2018-03-05 | 2018-03-05 | Semantic query expansion method and device based on word2vec |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810179478.3A CN108491462B (en) | 2018-03-05 | 2018-03-05 | Semantic query expansion method and device based on word2vec |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108491462A true CN108491462A (en) | 2018-09-04 |
CN108491462B CN108491462B (en) | 2021-09-14 |
Family
ID=63341204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810179478.3A Active CN108491462B (en) | 2018-03-05 | 2018-03-05 | Semantic query expansion method and device based on word2vec |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108491462B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063203A (en) * | 2018-09-14 | 2018-12-21 | 河海大学 | A kind of query word extended method based on personalized model |
CN109446399A (en) * | 2018-10-16 | 2019-03-08 | 北京信息科技大学 | A kind of video display entity search method |
CN109885766A (en) * | 2019-02-11 | 2019-06-14 | 武汉理工大学 | A kind of books recommended method and system based on book review |
CN110008407A (en) * | 2019-04-09 | 2019-07-12 | 苏州浪潮智能科技有限公司 | A kind of information retrieval method and device |
CN110188204A (en) * | 2019-06-11 | 2019-08-30 | 腾讯科技(深圳)有限公司 | A kind of extension corpora mining method, apparatus, server and storage medium |
CN110196977A (en) * | 2019-05-31 | 2019-09-03 | 广西南宁市博睿通软件技术有限公司 | A kind of intelligence alert inspection processing system and method |
CN110489526A (en) * | 2019-08-13 | 2019-11-22 | 上海市儿童医院 | A kind of term extended method, device and storage medium for medical retrieval |
CN110909116A (en) * | 2019-11-28 | 2020-03-24 | 中国人民解放军军事科学院军事科学信息研究中心 | Entity set expansion method and system for social media |
WO2020062770A1 (en) * | 2018-09-27 | 2020-04-02 | 深圳大学 | Method and apparatus for constructing domain dictionary, and device and storage medium |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN112199461A (en) * | 2020-09-17 | 2021-01-08 | 暨南大学 | Document retrieval method, device, medium and equipment based on block index structure |
WO2021032824A1 (en) * | 2019-08-20 | 2021-02-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. | Method and device for pre-selecting and determining similar documents |
CN112836008A (en) * | 2021-02-07 | 2021-05-25 | 中国科学院新疆理化技术研究所 | Index establishing method based on decentralized storage data |
CN112949304A (en) * | 2021-03-24 | 2021-06-11 | 中新国际联合研究院 | Construction case knowledge reuse query method and device |
CN113033197A (en) * | 2021-03-24 | 2021-06-25 | 中新国际联合研究院 | Building construction contract rule query method and device |
CN113486067A (en) * | 2021-07-16 | 2021-10-08 | 用友网络科技股份有限公司 | Information query method, system and readable storage medium |
CN114661852A (en) * | 2020-12-23 | 2022-06-24 | 深圳市万普拉斯科技有限公司 | Text searching method, terminal and readable storage medium |
CN114723008A (en) * | 2022-04-01 | 2022-07-08 | 北京健康之家科技有限公司 | Language representation model training method, device, equipment, medium and user response method |
CN116340470A (en) * | 2023-05-30 | 2023-06-27 | 环球数科集团有限公司 | Keyword associated retrieval system based on AIGC |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765769A (en) * | 2015-03-06 | 2015-07-08 | 大连理工大学 | Short text query expansion and indexing method based on word vector |
CN104778161A (en) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | Keyword extracting method based on Word2Vec and Query log |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
US9798820B1 (en) * | 2016-10-28 | 2017-10-24 | Searchmetrics Gmbh | Classification of keywords |
CN107391671A (en) * | 2017-07-21 | 2017-11-24 | 华中科技大学 | A kind of document leakage detection method and system |
US20180004815A1 (en) * | 2015-12-01 | 2018-01-04 | Huawei Technologies Co., Ltd. | Stop word identification method and apparatus |
-
2018
- 2018-03-05 CN CN201810179478.3A patent/CN108491462B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104765769A (en) * | 2015-03-06 | 2015-07-08 | 大连理工大学 | Short text query expansion and indexing method based on word vector |
CN104778161A (en) * | 2015-04-30 | 2015-07-15 | 车智互联(北京)科技有限公司 | Keyword extracting method based on Word2Vec and Query log |
US20180004815A1 (en) * | 2015-12-01 | 2018-01-04 | Huawei Technologies Co., Ltd. | Stop word identification method and apparatus |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
US9798820B1 (en) * | 2016-10-28 | 2017-10-24 | Searchmetrics Gmbh | Classification of keywords |
CN107391671A (en) * | 2017-07-21 | 2017-11-24 | 华中科技大学 | A kind of document leakage detection method and system |
Non-Patent Citations (3)
Title |
---|
ZHANG LIFENG等: "Behavior Targeting Based on Hierarchical Taxonomy Aggregation for Heterogeneous Online Shopping Applications", 《ZTE COMMUNICATIONS》 * |
徐康: "基于用户兴趣模型的个性化搜索排序研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
许侃等: "专利查询扩展的词向量方法研究", 《计算机科学与探索》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063203A (en) * | 2018-09-14 | 2018-12-21 | 河海大学 | A kind of query word extended method based on personalized model |
CN109063203B (en) * | 2018-09-14 | 2020-07-24 | 河海大学 | Query term expansion method based on personalized model |
WO2020062770A1 (en) * | 2018-09-27 | 2020-04-02 | 深圳大学 | Method and apparatus for constructing domain dictionary, and device and storage medium |
CN109446399A (en) * | 2018-10-16 | 2019-03-08 | 北京信息科技大学 | A kind of video display entity search method |
CN109885766A (en) * | 2019-02-11 | 2019-06-14 | 武汉理工大学 | A kind of books recommended method and system based on book review |
CN110008407A (en) * | 2019-04-09 | 2019-07-12 | 苏州浪潮智能科技有限公司 | A kind of information retrieval method and device |
CN110008407B (en) * | 2019-04-09 | 2021-05-04 | 苏州浪潮智能科技有限公司 | Information retrieval method and device |
CN110196977A (en) * | 2019-05-31 | 2019-09-03 | 广西南宁市博睿通软件技术有限公司 | A kind of intelligence alert inspection processing system and method |
CN110196977B (en) * | 2019-05-31 | 2023-06-09 | 广西南宁市博睿通软件技术有限公司 | Intelligent warning condition supervision processing system and method |
CN110188204A (en) * | 2019-06-11 | 2019-08-30 | 腾讯科技(深圳)有限公司 | A kind of extension corpora mining method, apparatus, server and storage medium |
CN110188204B (en) * | 2019-06-11 | 2022-10-04 | 腾讯科技(深圳)有限公司 | Extended corpus mining method and device, server and storage medium |
CN110489526A (en) * | 2019-08-13 | 2019-11-22 | 上海市儿童医院 | A kind of term extended method, device and storage medium for medical retrieval |
WO2021032824A1 (en) * | 2019-08-20 | 2021-02-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. | Method and device for pre-selecting and determining similar documents |
CN110909116A (en) * | 2019-11-28 | 2020-03-24 | 中国人民解放军军事科学院军事科学信息研究中心 | Entity set expansion method and system for social media |
CN110909116B (en) * | 2019-11-28 | 2022-12-23 | 中国人民解放军军事科学院军事科学信息研究中心 | Entity set expansion method and system for social media |
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN112199461B (en) * | 2020-09-17 | 2022-05-31 | 暨南大学 | Document retrieval method, device, medium and equipment based on block index structure |
CN112199461A (en) * | 2020-09-17 | 2021-01-08 | 暨南大学 | Document retrieval method, device, medium and equipment based on block index structure |
CN114661852A (en) * | 2020-12-23 | 2022-06-24 | 深圳市万普拉斯科技有限公司 | Text searching method, terminal and readable storage medium |
CN112836008B (en) * | 2021-02-07 | 2023-03-21 | 中国科学院新疆理化技术研究所 | Index establishing method based on decentralized storage data |
CN112836008A (en) * | 2021-02-07 | 2021-05-25 | 中国科学院新疆理化技术研究所 | Index establishing method based on decentralized storage data |
CN113033197A (en) * | 2021-03-24 | 2021-06-25 | 中新国际联合研究院 | Building construction contract rule query method and device |
CN112949304A (en) * | 2021-03-24 | 2021-06-11 | 中新国际联合研究院 | Construction case knowledge reuse query method and device |
CN113486067A (en) * | 2021-07-16 | 2021-10-08 | 用友网络科技股份有限公司 | Information query method, system and readable storage medium |
CN114723008A (en) * | 2022-04-01 | 2022-07-08 | 北京健康之家科技有限公司 | Language representation model training method, device, equipment, medium and user response method |
CN116340470A (en) * | 2023-05-30 | 2023-06-27 | 环球数科集团有限公司 | Keyword associated retrieval system based on AIGC |
CN116340470B (en) * | 2023-05-30 | 2023-09-15 | 环球数科集团有限公司 | Keyword associated retrieval system based on AIGC |
Also Published As
Publication number | Publication date |
---|---|
CN108491462B (en) | 2021-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491462A (en) | A kind of semantic query expansion method and device based on word2vec | |
Yin et al. | Ranking relevance in yahoo search | |
Carpineto et al. | A survey of automatic query expansion in information retrieval | |
Moawad et al. | Bi-gram term collocations-based query expansion approach for improving Arabic information retrieval | |
Yusuf et al. | Query expansion method for quran search using semantic search and lucene ranking | |
Madnani et al. | Multiple alternative sentence compressions for automatic text summarization | |
El Mahdaouy et al. | Semantically enhanced term frequency based on word embeddings for Arabic information retrieval | |
Grineva et al. | Blognoon: Exploring a topic in the blogosphere | |
Kanwal et al. | Adaptively intelligent meta-search engine with minimum edit distance | |
Pasca | Open-domain fine-grained class extraction from web search queries | |
CN113642325A (en) | Text keyword extraction method fusing text structure information and semantic information | |
Artese et al. | What is this painting about? Experiments on Unsupervised Keyphrases Extraction algorithms | |
Gulati et al. | Ontology driven query expansion for better image retrieval | |
Manjula et al. | Semantic search engine | |
Wang et al. | Exploiting semantic knowledge base for patent retrieval | |
CN106708808B (en) | Information mining method and device | |
Liu et al. | A query suggestion method based on random walk and topic concepts | |
Nwesri et al. | Applying Arabic stemming using query expansion | |
CN114186075B (en) | Semantic search method for knowledge graph in cultural domain | |
Martínez et al. | Evaluation of MIRACLE approach results for CLEF 2003 | |
Qin et al. | Expansion model of semantic query based on ontology | |
Reddy et al. | Cross lingual information retrieval using search engine and data mining | |
Khalid et al. | BERT-embedding and citation network analysis based query expansion technique for scholarly search | |
Larson | Experiments in classification clustering and thesaurus expansion for domain specific cross-language retrieval | |
Ramakrishna et al. | Information retrieval in Telugu language using synset relationships |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |