CN103064840A - Indexing equipment, indexing method, search device, search method and search system - Google Patents

Indexing equipment, indexing method, search device, search method and search system Download PDF

Info

Publication number
CN103064840A
CN103064840A CN2011103195489A CN201110319548A CN103064840A CN 103064840 A CN103064840 A CN 103064840A CN 2011103195489 A CN2011103195489 A CN 2011103195489A CN 201110319548 A CN201110319548 A CN 201110319548A CN 103064840 A CN103064840 A CN 103064840A
Authority
CN
China
Prior art keywords
word
high frequency
frequency words
document
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103195489A
Other languages
Chinese (zh)
Inventor
许欢庆
吴尉林
夏亮
郭永福
陈沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN2011103195489A priority Critical patent/CN103064840A/en
Publication of CN103064840A publication Critical patent/CN103064840A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides indexing equipment which comprises a high-frequency word processing module and an indexing establishment module, wherein when a present word in a document is a high-frequency word, the high-frequency word processing module is used for expanding the present word based on a front word and/or a back word adjacent to the present word and the indexing establishment module is used for establishing an index based on a new word obtained by being expanded and the document. Due to the fact that the technical scheme includes that the high-frequency words in keywords in the document are expanded and processed, the indexing equipment has the advantages of reducing the number of the high-frequency words in the keywords and avoiding high search volume and long search time caused by using a large number of the high-frequency words to establish the index. The invention further provides an indexing method, a search device, a search method and a search system.

Description

Indexing unit, indexing means, indexing unit, search method and searching system
Technical field
The present invention relates to field of computer technology, in particular to indexing unit, indexing means, indexing unit, search method and searching system.
Background technology
At present, search engine has become the main entrance of internet, and people are by search engine inquiry and location internet information resource.Inquire quickly and accurately information needed for the ease of the user, search engine provides multiple retrieval mode.Wherein, accurately the position that in document, occurs by the comprehensive evaluation query string such as string retrieval (PhraseQuery), contiguous retrieval (ProximityQuery), sequentially, the information such as frequency, effectively improved the inquiry degree of correlation of search engine.Usually, user's query requests comprises a plurality of words (statistics shows greater than 2.5 words), and the order between the word possesses stronger relevance semantically.For accurate string retrieval, the document that the customer requirements inquiry is returned must comprise complete retrieval string.For contiguous retrieval, retrieval set preferentially provides word appearance order the document consistent with the retrieval string.This shows that the user asks string whether to occur in document, the attributes such as the frequency of occurrences are the key factors that document relevance is estimated.
The retrieval modes such as accurately string retrieval, contiguous retrieval have improved the correlativity of retrieval effectively, but need in the retrieving calculating is mated in the keyword position of document, cause retrieval rate to decline to a great extent.At present, it is as follows that search engine carries out the processing logic of the accurate string retrieval request that the user submits to, the keyword string that at first retrieval request is related to carry out " with " retrieval, to " with " retrieval result document, carry out position judgment, judge and add up the frequency of complete retrieval string appearance, then calculate correlativity.
In search engine index, generally all adopt keyword to the inverted index structure of document information, each keyword appears at each document of its document chained list pointed.For the word that often occurs in all documents, we are called " high frequency words ", as the term suggests be exactly the higher word of frequency ratio of appearance, such as " ", " ", " I " etc.The document frequency that this class keywords not only occurs is high, in the document of each piece appearance, the number of times that occurs is also high, calculate the accuracy of correlativity for the later stage, in the document chained list, all can record the positional information that keyword occurs in the document, so in the table of falling the row chain, the document chained list that this class keywords points to is just suitable large.
Again for example, the user inquires about " my university ", and the word-dividing mode of search engine is processed into keyword string " I// university " with user request, according to inverted index, to " I ", " ", the inverted index tabulation of " university ", carry out and operation.For the document that includes simultaneously above-mentioned three words, read the positional information of three words in document, carry out corresponding statistics and judge.Because search key " I ", " " all be the word that the Chinese literature medium-high frequency occurs, its inverted index list length is very long, the frequency that occurs in document simultaneously is also very high, list of locations is also very long, cause whole query script calculated amount huge, have a strong impact on inquiry velocity, consuming time reaching more than level second under the extreme case.
Traditional technical scheme, in the retrieval string of retrieval with high frequency words, because the frequency that high frequency words occurs in document is high, in nearly all document all high frequency words can appear, so the retrieval string for this class just need to be to correlativity of whole document calculations, for more than one hundred million documents of search engine, calculated amount is quite huge, search for once very consuming timely, be unfavorable for user's experience.
Therefore, demand provides a kind of new index, search method, can overcome the shortcoming of prior art, in the situation that keeps the correlativity accuracy rate, effectively utilizes computer hardware resource, improves search efficiency, promotes user's experience.
Summary of the invention
The technical problem to be solved in the present invention is, a kind of new index, search method are provided, and can overcome the shortcoming of prior art, in the situation that keeps the correlativity accuracy rate, effectively utilizes computer hardware resource, improves search efficiency, promotes user's experience.
In view of this, the present invention proposes a kind of indexing unit, comprising: the high frequency words processing module, when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Module set up in index, sets up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.
In technique scheme, preferably, described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.
In technique scheme, preferably, described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.
The invention allows for a kind of indexing means, comprising: when step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.
In this technical scheme, preferably, described step 202 specifically comprises: described high frequency words processing module is at described front side word and/or described rear side word during also for high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.
In this technical scheme, preferably, described step 202 specifically comprises: described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.
The invention allows for a kind of indexing unit, comprising: the high frequency words processing module, when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Retrieval module, the neologisms that use expansion to obtain are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.
In technique scheme, preferably, also comprise: indexing unit described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.
In technique scheme, preferably, described high frequency words processing module is also added mark in the both sides of described neologisms; Described retrieval module obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.
The invention allows for a kind of search method, comprising: step 402, when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.
In technique scheme, preferably, before described step 404, also comprise: by indexing means described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.
In technique scheme, preferably, in described step 402, also comprise: described high frequency words processing module is added mark in the both sides of described neologisms; Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.
The invention allows for a kind of searching system, comprising: aforesaid indexing unit; Aforesaid indexing unit, described indexing unit uses the neologisms of its generation, retrieves in the index that described indexing unit is set up.In this technical scheme, the index that the mode of utilizing the high frequency words expansion to generate new keywords is set up is associated corresponding to the same retrieving of high frequency words extended mode that adopts, form a complete information retrieval system, make whole system when operation, can be under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, effectively utilize the hardware resource of computing machine, promote the user and experience.
Description of drawings
Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention;
Fig. 2 is the process flow diagram of indexing means according to an embodiment of the invention;
Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention;
Fig. 4 is the process flow diagram of search method according to an embodiment of the invention;
Fig. 5 is the block diagram of searching system according to an embodiment of the invention;
Fig. 6 is the high frequency words processing flow chart in the indexing means according to an embodiment of the invention;
Fig. 7 is that the high frequency words in the indexing means according to an embodiment of the invention is processed synoptic diagram;
Fig. 8 is the schematic flow sheet of indexing means according to an embodiment of the invention;
Fig. 9 is the schematic flow sheet of search method according to an embodiment of the invention;
Figure 10 is the synoptic diagram that has the data structure of using in the search engine now;
Figure 11 is the schematic flow sheet of indexing means according to an embodiment of the invention;
Figure 12 is the schematic flow sheet of indexing means according to an embodiment of the invention.
Embodiment
In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments the present invention is further described in detail.
Set forth in the following description a lot of details so that fully understand the present invention, still, the present invention can also adopt other to be different from other modes described here and implement, and therefore, the present invention is not limited to the restriction of following public specific embodiment.
At first, about the present invention the size that reduces the document chained list, the action principle that improves the advantage such as recall precision are explained herein.
The probability of the single high frequency words of user search is very little, and nonsensical, general high frequency words inquiry all forms with other word combinations to be inquired about, putting before this, the method that the present invention proposes is the word high frequency words in the document and high frequency words back or the front to be combined into a non-high frequency keyword index, when doing inquiry, for the query string with high frequency words, high frequency words and non-high frequency words combination inquiry can be reduced document in the unnatural death in the time of participle, improve counting yield.
About reducing in the inverted index keyword by the combination high frequency words (among the application, to be called keyword to the word that obtains after document or the retrieval string word segmentation processing) size of the document chained list that points to, process is as follows: suppose to have a collection of document U, the number of document is N u, include high frequency words W in the collection of document 1The document number be N 1(0<=N 1<=N u), so high frequency words W 1The probability that occurs
Figure BSA00000594922400061
Be:
F w 1 = N 1 / N u
Another keyword W 2The document number that (no matter being that high frequency words also is non-high frequency words) occurs is N 2(0<=N 2<=N u), keyword W 2The probability that occurs
Figure BSA00000594922400063
Be:
F w 2 = N 2 / N u
At this moment, if with W 1With W 2Be combined into a keyword W 1W 2(perhaps W 2W 1), the probability that this combination keyword occurs in document keyword W occurring exactly 1Document in search and comprise keyword W 2Document, the size of probability
Figure BSA00000594922400065
Be:
F w 1 w 2 = F w 1 * F w 2
= N 1 * N 2 / N u 2
If W 2Non-high frequency words, so N 2Certainly can not equal N u, namely not every document all comprises keyword W 2, the index frequency Size certainly be
Figure BSA00000594922400074
, therefore
Figure BSA00000594922400075
Certainly less than
Figure BSA00000594922400076
Namely make up string W 1W 2The document chain table size that points to of inverted index reduce.
If W 2High frequency words, N so 2Might equal N u, can all comprise keyword W by all documents 2If, high frequency words W at this moment 1The number of files N that occurs 1Equal N u, in the situation of not considering the position,
Figure BSA00000594922400077
The probability that occurs is 1, if consider position, W 1With W 2Must appear at together, that
Figure BSA00000594922400078
The probability that occurs certainly than
Figure BSA00000594922400079
The probability that occurs is low.
Therefore, can learn by above-mentioned analysis, in search key, have high frequency words W 1The time, if with the word W of its front side or rear side 2Make up, no matter W 2Whether be high frequency words, the new keywords W after then making up 1W 2(or W 2W 1) the document chained list that points to of corresponding inverted index can reduce.
Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention.
As shown in Figure 1, the present invention proposes a kind of indexing unit 100, comprising: high frequency words processing module 102, when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Module 104 set up in index, sets up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.
In technique scheme, described high frequency words processing module 102 is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.
In technique scheme, described high frequency words processing module 102 is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least
" mouse " can certainly be " mouse " or other, specifically selects several words or character to expand, and can set flexibly as required, then utilizes the new keyword that obtains after the expansion to set up index.
Fig. 2 is the process flow diagram of indexing means according to an embodiment of the invention.
As shown in Figure 2, the invention allows for a kind of indexing means, comprising: when step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.
In this technical scheme, described step 202 specifically comprises: described high frequency words processing module is at described front side word and/or described rear side word during also for high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.
In this technical scheme, described step 202 specifically comprises: described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.
Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention.
As shown in Figure 3, the invention allows for a kind of indexing unit 300, comprising: high frequency words processing module 302, when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Retrieval module 304, the neologisms that use expansion to obtain are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.
In technique scheme, also comprise: indexing unit 100 described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.
In technique scheme, described high frequency words processing module 302 is also added mark in the both sides of described neologisms; Described retrieval module 304 obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.
Fig. 4 is the process flow diagram of search method according to an embodiment of the invention.
As shown in Figure 4, the invention allows for a kind of search method, comprising: step 402, when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.
In technique scheme, before described step 404, also comprise: by indexing means described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.
In technique scheme, in described step 402, also comprise: described high frequency words processing module is added mark in the both sides of described neologisms; Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.
Fig. 5 is the block diagram of searching system according to an embodiment of the invention.
As shown in Figure 5, the invention allows for a kind of searching system 500, comprising: aforesaid indexing unit 100; Aforesaid indexing unit 300, described indexing unit 300 uses the neologisms of its generation, retrieves in the index that described indexing unit 100 is set up.In this technical scheme, the index that the mode of utilizing the high frequency words expansion to generate new keywords is set up is associated corresponding to the same retrieving of high frequency words extended mode that adopts, form a complete information retrieval system, make whole system when operation, can be under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, effectively utilize the hardware resource of computing machine, promote the user and experience.
Below describe technical scheme of the present invention in detail.
When the user uses search string search information, high frequency words generally with the laggard line search of other word combination, because the roving commission high frequency words is without any meaning, such as the user thinks that search is with the document of " my China " character string, the user takes search string " my China " to go retrieval, rather than search for one " " word, then in the result, go traversal whether to contain the document of " my China " character string with human eye.High frequency words and its former and later two keyword in the retrieval string are closely related, in retrieval string if there is high frequency words, that user certainly be want with retrieval string in the positional information of its former and later two word result of mating completely, if do not need the positional information coupling, then can remove high frequency words fully, if user search " my China ", that user is the document of wanting to have in the document " my China " these four words, search " my China ", that user is the document of wanting to comprise in the document " I " and " China " these two words, so user search is with the retrieval string of high frequency words, purpose is to connect two results about it, the position relationship (both must appear at the left and right sides two ends of current high frequency words) of two keywords in document about determining.In view of the situation, in the technical scheme of embodiments of the invention, when indexing, with the high frequency words expansion, connect to form new keyword with its former and later two word in document.
High frequency words and its two adjacent contamination keywords in document index, and recall precision can be faster, and the technical scheme that proposes in the embodiments of the invention is that two adjacent in high frequency words and its document word combination keywords index.This is because the quantity of keyword is very large, and there are every day neologisms to occur, the quantity of the new keywords of that high frequency words and word combination is very large too, if N high frequency words arranged now, the sum of word is M (comprising non-high frequency words and high frequency words), the keyword quantity maximal value of high frequency words combination reaches N*M, for the speed of retrieving, the lists of keywords of general index all is placed in the calculator memory, the size of internal memory is also restricting the size of lists of keywords, after the high frequency words combination, the size of lists of keywords has increased N doubly, probably causes internal memory can not satisfy lists of keywords.And high frequency words adds the combination of word, and the number of single character is limited, and the quantity of the keyword of high frequency words and word combination mostly is the twice of different single character quantity most, and internal memory can put down.
High frequency words combination process as shown in Figure 6.There is one must be high frequency words in two keywords that at first will make up, not so just do not have necessity of combination for two non-high frequency words.Process is as follows:
Step 602 confirms to have at least one to be high frequency words among the adjacent word W1W1.
Step 604 judges whether keyword W1 and keyword W2 are high frequency words, if carry out step 606, if not carry out step 608.
Step 606 is connected keyword W1 and is combined into new keyword with keyword W2, anabolic process finishes.
Step 608 judges that W1 is high frequency words.If so, enter step 610, if not, step 612 entered.
Step 610, with first Chinese character of W1 and W2 or character combination form new keyword (first character of W2 if Chinese character then with the Chinese character combination, if be non-Chinese character, then with the first character combination), anabolic process finishes.
Step 612, with first Chinese character or the new keyword (first character of W1 is if Chinese character then makes up with Chinese character, if be non-Chinese character, then with the first character combination) of character composition of W2 and W1, anabolic process finishes.
For example shown in Figure 7, character string " a1a2a3a4a5b1b2b3b4b5c1c2c3c4c5 " is arranged, cut word and become W1 word " a1a2a3a4a5 ", W2 word " b1b2b3b4b5 ", W3 word " c1c2c3c4c5 ".If W2 is high frequency words, W2 will make up a new keyword with W1 and W2, if W1 is high frequency words, the new keywords of that W2 and W1 combination is " a1a2a3a4a5b1b2b3b4b5 ", if W1 is non-high frequency words, the new keywords of that W2 and W1 combination is " a5b1b2b3b4b5 ", if a5 is Chinese character here, to account for 2 characters (the GBK encoding Chinese characters accounts for two characters), if be that non-Chinese character accounts for a character.If W3 is high frequency words, the new keywords of that W2 and W3 combination is " b1b2b3b4b5c1c2c3c4c5 ", if W3 is non-high frequency words, the new keywords of W2 and W3 combination is " b1b2b3b4b5c1 " so, c1 is Chinese character, will account for 2 characters, if be that non-Chinese character accounts for a character.
Different from traditional high frequency words index is, in the technical scheme of present embodiment, the portmanteau word that has added high frequency words indexes, and Index process and retrieving are just different with traditional mode, as shown in Figure 8 Index process:
Step 802 is cut word to new document data.Cut in the document data behind the word, high frequency words is arranged, high frequency words is independently, not combination.
Step 804 makes up by mode shown in Figure 7 document medium-high frequency word.Document keyword word frequency after the statistical combination, positional information.
Step 806, according to keyword information is added document information in the inverted index storehouse, finishes until load document.
And corresponding retrieving is as shown in Figure 9:
Step 902 receives the retrieval string with high frequency words that the user inputs.
Step 904 is cut word to the retrieval string.
Step 906 is analyzed the data cut behind the word, has high frequency words or high frequency words (not getting rid of the special circumstances that the user only looks into high frequency words) independently in user's retrieval string, and high frequency words and front and back word in the retrieval string are made up.The keyword that coincidence may be arranged in the keyword after the combination, high frequency words and front and back word combination be not so occur successively on the position.
Step 908 is according to the keyword retrieval index database of cutting behind the word.According to positional information, get rid of the position and overlap.
Step 910 is calculated correlativity, exports a maximally related TopN result.
For example, in inquiry during with the retrieval string of high frequency words, to make marks to the new keywords after the combination, the expression neologisms are the new keywords that formed by two word combinations, general employing adds " # " and makes marks behind the new keywords of combination, character string " a1a2a3a4a5b1b2b3b4b5c1c2c3c4c5 " is when doing query string as shown in Figure 7, and cutting is W1, W2, three keywords of W3, suppose that W2 is high frequency words, the new keywords after the combination has:
" a1a2aa3a4a5b1b2b3b4b5# " W1 is high frequency words,
" a5b1b2b3b4b5# " W1 is non-high frequency words,
" b1b2b3b4b5c1c2c3c4c5# " W3 is high frequency words,
" b1b2b3b4b5c1# " W3 is non-high frequency words.
All neologisms back are all with the label symbol " # ", in order to distinguish general non-high frequency keyword.
Below go on to say technical scheme of the present invention.
At present, the search engine of main flow mainly depends on three data structures such as lexicon file, Inverted List file, list of locations file and implements the search operaqtion logic, as shown in figure 10.Wherein, the offset information of inverted entry tabulation in the Inverted List file of lexicon file record word and word.The Inverted List file record inverted entry table data of all words.The positional information that in document, occurs of all words of list of locations file record.Because high frequency words frequently appears in the document sets (some word appears in 70% the document), the frequency that occurs in the monolithic document simultaneously is also very high, and therefore, the Inverted List length and location list length that word is corresponding is all very long.Embodiments of the invention propose to set up in the process at index, with high frequency words with its before and after the word that occurs carry out the combination of certain mode, formation high frequency cascade word, and the position that high frequency cascade word is set is that former lexeme is put.When retrieval, submit to the retrieval request string to carry out identical processing to the user, will to the inquiry of high frequency words, replace to the inquiry of high frequency cascade word.Because, high frequency cascade word collection of document and in the monolithic document frequency of occurrences therefore greatly reduced the scale that need to carry out AND-operation and position calculation all far below former high frequency words, Effective Raise the speed of string retrieval, do not lose the inquiry correctness.
Set up the overall technical architecture of index as shown in figure 11:
Specifically comprise: step 1102 judges whether to carry out the document of index.
Step 1104 reads the document for the treatment of index.
Step 1106 is carried out participle and position mark to document.
Step 1108 is carried out cascade to high frequency words and is processed.
Step 1110 is added the index that generates to index database.
The processing logic of high frequency cascade word is as follows, at first input text is carried out participle, position mark processing, generates and is just arranging index, and other steps are as follows:
Step 1202 reads the word of just arranging index successively.
Step 1204 judges whether in addition word, is then to enter step 1206, otherwise end operation.
Step 1206, according to the high frequency vocabulary that generates in advance, each word in the row's of aligning index filters, and judges whether it is high frequency words.If non-high frequency words then enters step 1208, otherwise, step 1204 returned.
Step 1208 is analysed the word of this word front.If front word exists, enter step 1210, do not exist then to enter step 1214.
Step 1210, whether word is high frequency words before judging, is then to enter step 1214, otherwise enters step 1212.
Step 1212 forms neologisms with first Chinese word of front word (if English word, then getting the first character of word) and this word.
Step 1214 is analyzed the word of this word back.If rear word exists, enter step 1216, otherwise return step 1204.
Step 1216, whether word is high frequency words after judging, is then to enter step 1218, otherwise enters step 1220.
Step 1218 then makes up this word and rear word, generates high frequency cascade word, and the position of high frequency cascade word is recorded as the position of current word.
Step 1220 if rear word exists, and non-high frequency words, then makes up first Chinese word of this word and rear word (if English word, then getting the first character of word).
Step 1222 for neologisms add the cascade label symbol, generates high frequency cascade word.
Step 1224 is recorded as current location with the position of high frequency cascade word, is inserted into and just arranges index.
The cascade label symbol can be any one not symbol of participation index and retrieval.Native system is in order to express easily chosen " # " as the cascade label symbol.Because therefore not participation index and the retrieval of cascade label symbol can not comprise high frequency cascade word in the normal word segmentation result by the word-dividing mode generation, can not produce conflict.
For example, index is set up in the process, for document: " my university is very beautiful ", cut word, the lexeme tagging after processing is:
(I, 1)/(, 2)/(university, 3)/(very, 4)/(beauty, 5)
By high frequency vocabulary inquiry, can know " I " and " " be high frequency words, carry out high frequency cascade word processing logic, the just row document that needs to set up index after the processing is as follows:
(I, 1)/(my # of #, 1)/(, 2)/(the large # of #, 2)/(university, 3)/(very, 4)/(beauty, 5)
For the above-mentioned document of just arranging, according to the normal process logic, set up inverted entry.(I, 1), (large, 2) are high frequency cascade glossarial index items.The string query script also carries out same processing logic.Such as, for string query requests " my university ", cut after word, lexeme tagging are processed and be:
(I, 1)/(, 2)/(university, 3)
Carry out high frequency cascade word processing logic, obtain the set of final retrieval lexical item:
(I, 1)/(my # of #, 1)/(, 2)/(the large # of #, 2)/(university, 3)
Retrieval phase, string retrieval logic only needs query set:
(I, 1)/(large, 2)/(university, 3)
Retrieve, at first read the Inverted List of " I ", " large ", " university ", carry out " with " retrieval of logic.For the document that comprises above-mentioned three words, read the list of locations information of three words, adopt certain method, judge whether order occurs each word.Only when three word orders appearance, the statistics occurrence number is with this factor of estimating as correlativity.Obviously, the frequency of occurrences of " my # of # ", " the large # of # " far below " I ", " ", the tabulation of its inverted index and list of locations length are all far little a latter, reduces greatly calculated amount, has improved retrieval rate.
The adding of high frequency cascade word has increased the size of search engine vocabulary, has also increased the scale of inverted entry listing file and position paper simultaneously.Consider that based on search efficiency search engine imports lexicon file in the internal memory usually in running status.In theory, high frequency cascade word generation module may produce 2*n*n+2*n*m neologisms, and wherein, n is the number of high frequency words, and m is the number of Chinese character and English alphabet, because the cascade direction, the number of word need to take advantage of 2.But because document meets certain grammer pragmatic rules usually, the neologisms that produce in the actual text Index process will be much smaller than theoretical value.By the value of control n, m, the scale of newly-generated word can be controlled within the acceptable scope of present run mode retrieval server hardware internal memory.Inverted entry listing file and position paper all are stored in disk, and the growth on its scale does not bring the accessibility loss of energy.Thus, high frequency cascade word processing policy changes the method for time by the space, has significantly improved string in the retrieval and has searched efficient with frequency computation part, has improved retrieval rate.
In sum, according to technical scheme of the present invention, can realize indexing unit, indexing means, indexing unit, search method and searching system, document index and retrieving at search engine, high frequency words and front and back word are just carried out the combination of certain mode, form high frequency cascade word and carry out index, in retrieval phase, substitute former high frequency words with high frequency cascade word and participate in retrieval.Because the Inverted List length and location list length of high frequency cascade word is much smaller than former high frequency words, search operand with statistical string frequency thereby greatly reduced string in the retrieving, when guaranteeing the retrieval accuracy, significantly improved effectiveness of retrieval.The present invention considered under the computer hardware environment, for present traditional index structure, solved with the string retrieval calculated amount of high frequency words greatly, and slow-footed problem exchanges upper fast query of time for limited space resources, improves user's experience.
More than be described with reference to the accompanying drawings technical scheme of the present invention, by being carried out expanded set, described high frequency words is combined into new keyword, and utilize this combination keyword to set up index database and retrieve, thereby under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, utilize limited space resources to realize the Effective Raise of recall precision, promote the user and experience.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (13)

1. an indexing unit is characterized in that, comprising:
The high frequency words processing module when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;
Module set up in index, sets up index according to neologisms and described document that expansion obtains.
2. indexing unit according to claim 1 is characterized in that, described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.
3. indexing unit according to claim 1, it is characterized in that, described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.
4. an indexing means is characterized in that, comprising:
When step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded;
Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.
5. indexing means according to claim 4 is characterized in that, described step 202 specifically comprises:
Described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.
6. indexing means according to claim 4 is characterized in that, described step 202 specifically comprises:
Described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.
7. an indexing unit is characterized in that, comprising:
The high frequency words processing module when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;
Retrieval module, the neologisms that use expansion to obtain are retrieved in pre-established index.
8. indexing unit according to claim 7 is characterized in that, also comprises:
Such as each described indexing unit in the claim 1 to 4, pre-established described index.
9. indexing unit according to claim 8 is characterized in that, described high frequency words processing module is also added mark in the both sides of described neologisms;
Described retrieval module obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.
10. a search method is characterized in that, comprising:
Step 402 when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;
Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.
11. search method according to claim 10 is characterized in that, before described step 404, also comprises:
By such as each described indexing means in the claim 4 to 6, pre-established described index.
12. search method according to claim 11 is characterized in that, in described step 402, also comprises:
Described high frequency words processing module is added mark in the both sides of described neologisms;
Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.
13. a searching system is characterized in that, comprising:
Each described indexing unit in the claims 1 to 3;
Each described indexing unit in the claim 7 to 9, described indexing unit uses the neologisms of its generation, retrieves in the index that described indexing unit is set up.
CN2011103195489A 2011-10-20 2011-10-20 Indexing equipment, indexing method, search device, search method and search system Pending CN103064840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103195489A CN103064840A (en) 2011-10-20 2011-10-20 Indexing equipment, indexing method, search device, search method and search system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103195489A CN103064840A (en) 2011-10-20 2011-10-20 Indexing equipment, indexing method, search device, search method and search system

Publications (1)

Publication Number Publication Date
CN103064840A true CN103064840A (en) 2013-04-24

Family

ID=48107470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103195489A Pending CN103064840A (en) 2011-10-20 2011-10-20 Indexing equipment, indexing method, search device, search method and search system

Country Status (1)

Country Link
CN (1) CN103064840A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572678A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Index establishment method and device
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136274A1 (en) * 2005-12-02 2007-06-14 Daisuke Takuma System of effectively searching text for keyword, and method thereof
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101088082A (en) * 2004-10-25 2007-12-12 英孚威尔公司 Full text query and search systems and methods of use
CN101963965A (en) * 2009-07-23 2011-02-02 阿里巴巴集团控股有限公司 Document indexing method, data query method and server based on search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101088082A (en) * 2004-10-25 2007-12-12 英孚威尔公司 Full text query and search systems and methods of use
US20070136274A1 (en) * 2005-12-02 2007-06-14 Daisuke Takuma System of effectively searching text for keyword, and method thereof
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101963965A (en) * 2009-07-23 2011-02-02 阿里巴巴集团控股有限公司 Document indexing method, data query method and server based on search engine

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572678A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Index establishment method and device
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system

Similar Documents

Publication Publication Date Title
Ferreira et al. Assessing sentence scoring techniques for extractive text summarization
CN103514183B (en) Information search method and system based on interactive document clustering
CN103136352B (en) Text retrieval system based on double-deck semantic analysis
US8073838B2 (en) Pseudo-anchor text extraction
CN101055580B (en) System, method and user interface for retrieving documents
CN104199965B (en) Semantic information retrieval method
CN108038096A (en) Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing
CN108829658A (en) The method and device of new word discovery
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN103886099B (en) Semantic retrieval system and method of vague concepts
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
CN103440232A (en) Automatic sScientific paper standardization automatic detecting and editing method
CN106484797A (en) Accident summary abstracting method based on sparse study
CN103440233A (en) Automatic sScientific paper standardization automatic detecting and editing system
CN102722498A (en) Search engine and implementation method thereof
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN102567409A (en) Method and device for providing retrieval associated word
CN104484380A (en) Personalized search method and personalized search device
CN101526956A (en) Webpage searching result sequencing method based on content reference
CN102722501A (en) Search engine and realization method thereof
CN105404677B (en) A kind of search method based on tree structure
CN102722499A (en) Search engine and implementation method thereof
CN102737021A (en) Search engine and realization method thereof
CN103064844A (en) Indexing equipment, indexing method, search device, search method and search system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130424