CN103064840A

CN103064840A - Indexing equipment, indexing method, search device, search method and search system

Info

Publication number: CN103064840A
Application number: CN2011103195489A
Authority: CN
Inventors: 许欢庆; 吴尉林; 夏亮; 郭永福; 陈沛
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Zhongsou Network Technology Co ltd
Priority date: 2011-10-20
Filing date: 2011-10-20
Publication date: 2013-04-24

Abstract

The invention provides indexing equipment which comprises a high-frequency word processing module and an indexing establishment module, wherein when a present word in a document is a high-frequency word, the high-frequency word processing module is used for expanding the present word based on a front word and/or a back word adjacent to the present word and the indexing establishment module is used for establishing an index based on a new word obtained by being expanded and the document. Due to the fact that the technical scheme includes that the high-frequency words in keywords in the document are expanded and processed, the indexing equipment has the advantages of reducing the number of the high-frequency words in the keywords and avoiding high search volume and long search time caused by using a large number of the high-frequency words to establish the index. The invention further provides an indexing method, a search device, a search method and a search system.

Description

Indexing unit, indexing means, indexing unit, search method and searching system

Technical field

The present invention relates to field of computer technology, in particular to indexing unit, indexing means, indexing unit, search method and searching system.

Background technology

At present, search engine has become the main entrance of internet, and people are by search engine inquiry and location internet information resource.Inquire quickly and accurately information needed for the ease of the user, search engine provides multiple retrieval mode.Wherein, accurately the position that in document, occurs by the comprehensive evaluation query string such as string retrieval (PhraseQuery), contiguous retrieval (ProximityQuery), sequentially, the information such as frequency, effectively improved the inquiry degree of correlation of search engine.Usually, user's query requests comprises a plurality of words (statistics shows greater than 2.5 words), and the order between the word possesses stronger relevance semantically.For accurate string retrieval, the document that the customer requirements inquiry is returned must comprise complete retrieval string.For contiguous retrieval, retrieval set preferentially provides word appearance order the document consistent with the retrieval string.This shows that the user asks string whether to occur in document, the attributes such as the frequency of occurrences are the key factors that document relevance is estimated.

The retrieval modes such as accurately string retrieval, contiguous retrieval have improved the correlativity of retrieval effectively, but need in the retrieving calculating is mated in the keyword position of document, cause retrieval rate to decline to a great extent.At present, it is as follows that search engine carries out the processing logic of the accurate string retrieval request that the user submits to, the keyword string that at first retrieval request is related to carry out " with " retrieval, to " with " retrieval result document, carry out position judgment, judge and add up the frequency of complete retrieval string appearance, then calculate correlativity.

In search engine index, generally all adopt keyword to the inverted index structure of document information, each keyword appears at each document of its document chained list pointed.For the word that often occurs in all documents, we are called " high frequency words ", as the term suggests be exactly the higher word of frequency ratio of appearance, such as " ", " ", " I " etc.The document frequency that this class keywords not only occurs is high, in the document of each piece appearance, the number of times that occurs is also high, calculate the accuracy of correlativity for the later stage, in the document chained list, all can record the positional information that keyword occurs in the document, so in the table of falling the row chain, the document chained list that this class keywords points to is just suitable large.

Again for example, the user inquires about " my university ", and the word-dividing mode of search engine is processed into keyword string " I// university " with user request, according to inverted index, to " I ", " ", the inverted index tabulation of " university ", carry out and operation.For the document that includes simultaneously above-mentioned three words, read the positional information of three words in document, carry out corresponding statistics and judge.Because search key " I ", " " all be the word that the Chinese literature medium-high frequency occurs, its inverted index list length is very long, the frequency that occurs in document simultaneously is also very high, list of locations is also very long, cause whole query script calculated amount huge, have a strong impact on inquiry velocity, consuming time reaching more than level second under the extreme case.

Traditional technical scheme, in the retrieval string of retrieval with high frequency words, because the frequency that high frequency words occurs in document is high, in nearly all document all high frequency words can appear, so the retrieval string for this class just need to be to correlativity of whole document calculations, for more than one hundred million documents of search engine, calculated amount is quite huge, search for once very consuming timely, be unfavorable for user's experience.

Therefore, demand provides a kind of new index, search method, can overcome the shortcoming of prior art, in the situation that keeps the correlativity accuracy rate, effectively utilizes computer hardware resource, improves search efficiency, promotes user's experience.

Summary of the invention

The technical problem to be solved in the present invention is, a kind of new index, search method are provided, and can overcome the shortcoming of prior art, in the situation that keeps the correlativity accuracy rate, effectively utilizes computer hardware resource, improves search efficiency, promotes user's experience.

In view of this, the present invention proposes a kind of indexing unit, comprising: the high frequency words processing module, when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Module set up in index, sets up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.

In technique scheme, preferably, described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.

In technique scheme, preferably, described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.

The invention allows for a kind of indexing means, comprising: when step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.

In this technical scheme, preferably, described step 202 specifically comprises: described high frequency words processing module is at described front side word and/or described rear side word during also for high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.

In this technical scheme, preferably, described step 202 specifically comprises: described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.

The invention allows for a kind of indexing unit, comprising: the high frequency words processing module, when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Retrieval module, the neologisms that use expansion to obtain are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.

In technique scheme, preferably, also comprise: indexing unit described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.

In technique scheme, preferably, described high frequency words processing module is also added mark in the both sides of described neologisms; Described retrieval module obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.

The invention allows for a kind of search method, comprising: step 402, when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.

In technique scheme, preferably, before described step 404, also comprise: by indexing means described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.

In technique scheme, preferably, in described step 402, also comprise: described high frequency words processing module is added mark in the both sides of described neologisms; Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.

The invention allows for a kind of searching system, comprising: aforesaid indexing unit; Aforesaid indexing unit, described indexing unit uses the neologisms of its generation, retrieves in the index that described indexing unit is set up.In this technical scheme, the index that the mode of utilizing the high frequency words expansion to generate new keywords is set up is associated corresponding to the same retrieving of high frequency words extended mode that adopts, form a complete information retrieval system, make whole system when operation, can be under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, effectively utilize the hardware resource of computing machine, promote the user and experience.

Description of drawings

Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention;

Fig. 2 is the process flow diagram of indexing means according to an embodiment of the invention;

Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention;

Fig. 4 is the process flow diagram of search method according to an embodiment of the invention;

Fig. 5 is the block diagram of searching system according to an embodiment of the invention;

Fig. 6 is the high frequency words processing flow chart in the indexing means according to an embodiment of the invention;

Fig. 7 is that the high frequency words in the indexing means according to an embodiment of the invention is processed synoptic diagram;

Fig. 8 is the schematic flow sheet of indexing means according to an embodiment of the invention;

Fig. 9 is the schematic flow sheet of search method according to an embodiment of the invention;

Figure 10 is the synoptic diagram that has the data structure of using in the search engine now;

Figure 11 is the schematic flow sheet of indexing means according to an embodiment of the invention;

Figure 12 is the schematic flow sheet of indexing means according to an embodiment of the invention.

Embodiment

In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments the present invention is further described in detail.

Set forth in the following description a lot of details so that fully understand the present invention, still, the present invention can also adopt other to be different from other modes described here and implement, and therefore, the present invention is not limited to the restriction of following public specific embodiment.

At first, about the present invention the size that reduces the document chained list, the action principle that improves the advantage such as recall precision are explained herein.

The probability of the single high frequency words of user search is very little, and nonsensical, general high frequency words inquiry all forms with other word combinations to be inquired about, putting before this, the method that the present invention proposes is the word high frequency words in the document and high frequency words back or the front to be combined into a non-high frequency keyword index, when doing inquiry, for the query string with high frequency words, high frequency words and non-high frequency words combination inquiry can be reduced document in the unnatural death in the time of participle, improve counting yield.

About reducing in the inverted index keyword by the combination high frequency words (among the application, to be called keyword to the word that obtains after document or the retrieval string word segmentation processing) size of the document chained list that points to, process is as follows: suppose to have a collection of document U, the number of document is N _u, include high frequency words W in the collection of document ₁The document number be N ₁(0＜=N ₁＜=N _u), so high frequency words W ₁The probability that occurs

Be:

F_{w_{1}} = N_{1} / N_{u}

Another keyword W ₂The document number that (no matter being that high frequency words also is non-high frequency words) occurs is N ₂(0＜=N ₂＜=N _u), keyword W ₂The probability that occurs

Be:

F_{w_{2}} = N_{2} / N_{u}

At this moment, if with W ₁With W ₂Be combined into a keyword W ₁W ₂(perhaps W ₂W ₁), the probability that this combination keyword occurs in document keyword W occurring exactly ₁Document in search and comprise keyword W ₂Document, the size of probability

Be:

F_{w_{1} w_{2}} = F_{w_{1}} * F_{w_{2}}

= N_{1} * N_{2} {/ N}_{u}^{2}

If W ₂Non-high frequency words, so N ₂Certainly can not equal N _u, namely not every document all comprises keyword W ₂, the index frequency Size certainly be

, therefore

Certainly less than

Namely make up string W ₁W ₂The document chain table size that points to of inverted index reduce.

If W ₂High frequency words, N so ₂Might equal N _u, can all comprise keyword W by all documents ₂If, high frequency words W at this moment ₁The number of files N that occurs ₁Equal N _u, in the situation of not considering the position,

The probability that occurs is 1, if consider position, W ₁With W ₂Must appear at together, that

The probability that occurs certainly than

The probability that occurs is low.

Therefore, can learn by above-mentioned analysis, in search key, have high frequency words W ₁The time, if with the word W of its front side or rear side ₂Make up, no matter W ₂Whether be high frequency words, the new keywords W after then making up ₁W ₂(or W ₂W ₁) the document chained list that points to of corresponding inverted index can reduce.

Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention.

As shown in Figure 1, the present invention proposes a kind of indexing unit 100, comprising: high frequency words processing module 102, when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Module 104 set up in index, sets up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.

In technique scheme, described high frequency words processing module 102 is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.

In technique scheme, described high frequency words processing module 102 is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least

" mouse " can certainly be " mouse " or other, specifically selects several words or character to expand, and can set flexibly as required, then utilizes the new keyword that obtains after the expansion to set up index.

Fig. 2 is the process flow diagram of indexing means according to an embodiment of the invention.

As shown in Figure 2, the invention allows for a kind of indexing means, comprising: when step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.In this technical scheme, by the high frequency words in the document keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to set up index and cause too high retrieval amount and long retrieval time.

In this technical scheme, described step 202 specifically comprises: described high frequency words processing module is at described front side word and/or described rear side word during also for high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side also is high frequency words, for example in " my motherland ", " " be high frequency words, when it is expanded, the keyword " I " of front side is high frequency words equally, then incite somebody to action " " expand to " I " combination " I ", be used for setting up index as new keyword.

In this technical scheme, described step 202 specifically comprises: described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.In this technical scheme, when the keyword of high frequency words front side and/or rear side is non-high frequency words, for example " mouse pad on the desk ", if get " " as high frequency words, and " mouse pad " of front side " on the desk " and rear side is non-high frequency words, then the mode with the non-high frequency words in front side combination expansion is to get last at least one word or the character of front side keyword, namely expands at least " on ", can certainly be " on the table " or other; And with the mode of the non-high frequency words of rear side combination expansion be to get the most front at least one word or the character of rear side keyword, namely expand at least " mouse ", can certainly be " mouse " or other, specifically select several words or character to expand, can set flexibly as required, then utilize the new keyword that obtains after the expansion to set up index.

Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention.

As shown in Figure 3, the invention allows for a kind of indexing unit 300, comprising: high frequency words processing module 302, when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Retrieval module 304, the neologisms that use expansion to obtain are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.

In technique scheme, also comprise: indexing unit 100 described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.

In technique scheme, described high frequency words processing module 302 is also added mark in the both sides of described neologisms; Described retrieval module 304 obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.

Fig. 4 is the process flow diagram of search method according to an embodiment of the invention.

As shown in Figure 4, the invention allows for a kind of search method, comprising: step 402, when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded; Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.In this technical scheme, by the high frequency words in the retrieval string keyword is carried out extension process, reduced the quantity of keyword medium-high frequency word, avoided utilizing a large amount of high frequency words to come search index and cause too high retrieval amount and long retrieval time.

In technique scheme, before described step 404, also comprise: by indexing means described above, pre-established described index.By technical scheme, in conjunction with the index that obtains by technique scheme, can further optimize retrieval.

In technique scheme, in described step 402, also comprise: described high frequency words processing module is added mark in the both sides of described neologisms; Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.By this technical scheme, adopt accurately string subquery, can guarantee the accuracy of retrieving.

Fig. 5 is the block diagram of searching system according to an embodiment of the invention.

As shown in Figure 5, the invention allows for a kind of searching system 500, comprising: aforesaid indexing unit 100; Aforesaid indexing unit 300, described indexing unit 300 uses the neologisms of its generation, retrieves in the index that described indexing unit 100 is set up.In this technical scheme, the index that the mode of utilizing the high frequency words expansion to generate new keywords is set up is associated corresponding to the same retrieving of high frequency words extended mode that adopts, form a complete information retrieval system, make whole system when operation, can be under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, effectively utilize the hardware resource of computing machine, promote the user and experience.

Below describe technical scheme of the present invention in detail.

When the user uses search string search information, high frequency words generally with the laggard line search of other word combination, because the roving commission high frequency words is without any meaning, such as the user thinks that search is with the document of " my China " character string, the user takes search string " my China " to go retrieval, rather than search for one " " word, then in the result, go traversal whether to contain the document of " my China " character string with human eye.High frequency words and its former and later two keyword in the retrieval string are closely related, in retrieval string if there is high frequency words, that user certainly be want with retrieval string in the positional information of its former and later two word result of mating completely, if do not need the positional information coupling, then can remove high frequency words fully, if user search " my China ", that user is the document of wanting to have in the document " my China " these four words, search " my China ", that user is the document of wanting to comprise in the document " I " and " China " these two words, so user search is with the retrieval string of high frequency words, purpose is to connect two results about it, the position relationship (both must appear at the left and right sides two ends of current high frequency words) of two keywords in document about determining.In view of the situation, in the technical scheme of embodiments of the invention, when indexing, with the high frequency words expansion, connect to form new keyword with its former and later two word in document.

High frequency words and its two adjacent contamination keywords in document index, and recall precision can be faster, and the technical scheme that proposes in the embodiments of the invention is that two adjacent in high frequency words and its document word combination keywords index.This is because the quantity of keyword is very large, and there are every day neologisms to occur, the quantity of the new keywords of that high frequency words and word combination is very large too, if N high frequency words arranged now, the sum of word is M (comprising non-high frequency words and high frequency words), the keyword quantity maximal value of high frequency words combination reaches N*M, for the speed of retrieving, the lists of keywords of general index all is placed in the calculator memory, the size of internal memory is also restricting the size of lists of keywords, after the high frequency words combination, the size of lists of keywords has increased N doubly, probably causes internal memory can not satisfy lists of keywords.And high frequency words adds the combination of word, and the number of single character is limited, and the quantity of the keyword of high frequency words and word combination mostly is the twice of different single character quantity most, and internal memory can put down.

High frequency words combination process as shown in Figure 6.There is one must be high frequency words in two keywords that at first will make up, not so just do not have necessity of combination for two non-high frequency words.Process is as follows:

Step 602 confirms to have at least one to be high frequency words among the adjacent word W1W1.

Step 604 judges whether keyword W1 and keyword W2 are high frequency words, if carry out step 606, if not carry out step 608.

Step 606 is connected keyword W1 and is combined into new keyword with keyword W2, anabolic process finishes.

Step 608 judges that W1 is high frequency words.If so, enter step 610, if not, step 612 entered.

Step 610, with first Chinese character of W1 and W2 or character combination form new keyword (first character of W2 if Chinese character then with the Chinese character combination, if be non-Chinese character, then with the first character combination), anabolic process finishes.

Step 612, with first Chinese character or the new keyword (first character of W1 is if Chinese character then makes up with Chinese character, if be non-Chinese character, then with the first character combination) of character composition of W2 and W1, anabolic process finishes.

For example shown in Figure 7, character string " a1a2a3a4a5b1b2b3b4b5c1c2c3c4c5 " is arranged, cut word and become W1 word " a1a2a3a4a5 ", W2 word " b1b2b3b4b5 ", W3 word " c1c2c3c4c5 ".If W2 is high frequency words, W2 will make up a new keyword with W1 and W2, if W1 is high frequency words, the new keywords of that W2 and W1 combination is " a1a2a3a4a5b1b2b3b4b5 ", if W1 is non-high frequency words, the new keywords of that W2 and W1 combination is " a5b1b2b3b4b5 ", if a5 is Chinese character here, to account for 2 characters (the GBK encoding Chinese characters accounts for two characters), if be that non-Chinese character accounts for a character.If W3 is high frequency words, the new keywords of that W2 and W3 combination is " b1b2b3b4b5c1c2c3c4c5 ", if W3 is non-high frequency words, the new keywords of W2 and W3 combination is " b1b2b3b4b5c1 " so, c1 is Chinese character, will account for 2 characters, if be that non-Chinese character accounts for a character.

Different from traditional high frequency words index is, in the technical scheme of present embodiment, the portmanteau word that has added high frequency words indexes, and Index process and retrieving are just different with traditional mode, as shown in Figure 8 Index process:

Step 802 is cut word to new document data.Cut in the document data behind the word, high frequency words is arranged, high frequency words is independently, not combination.

Step 804 makes up by mode shown in Figure 7 document medium-high frequency word.Document keyword word frequency after the statistical combination, positional information.

Step 806, according to keyword information is added document information in the inverted index storehouse, finishes until load document.

And corresponding retrieving is as shown in Figure 9:

Step 902 receives the retrieval string with high frequency words that the user inputs.

Step 904 is cut word to the retrieval string.

Step 906 is analyzed the data cut behind the word, has high frequency words or high frequency words (not getting rid of the special circumstances that the user only looks into high frequency words) independently in user's retrieval string, and high frequency words and front and back word in the retrieval string are made up.The keyword that coincidence may be arranged in the keyword after the combination, high frequency words and front and back word combination be not so occur successively on the position.

Step 908 is according to the keyword retrieval index database of cutting behind the word.According to positional information, get rid of the position and overlap.

Step 910 is calculated correlativity, exports a maximally related TopN result.

For example, in inquiry during with the retrieval string of high frequency words, to make marks to the new keywords after the combination, the expression neologisms are the new keywords that formed by two word combinations, general employing adds " # " and makes marks behind the new keywords of combination, character string " a1a2a3a4a5b1b2b3b4b5c1c2c3c4c5 " is when doing query string as shown in Figure 7, and cutting is W1, W2, three keywords of W3, suppose that W2 is high frequency words, the new keywords after the combination has:

" a1a2aa3a4a5b1b2b3b4b5# " W1 is high frequency words,

" a5b1b2b3b4b5# " W1 is non-high frequency words,

" b1b2b3b4b5c1c2c3c4c5# " W3 is high frequency words,

" b1b2b3b4b5c1# " W3 is non-high frequency words.

All neologisms back are all with the label symbol " # ", in order to distinguish general non-high frequency keyword.

Below go on to say technical scheme of the present invention.

At present, the search engine of main flow mainly depends on three data structures such as lexicon file, Inverted List file, list of locations file and implements the search operaqtion logic, as shown in figure 10.Wherein, the offset information of inverted entry tabulation in the Inverted List file of lexicon file record word and word.The Inverted List file record inverted entry table data of all words.The positional information that in document, occurs of all words of list of locations file record.Because high frequency words frequently appears in the document sets (some word appears in 70% the document), the frequency that occurs in the monolithic document simultaneously is also very high, and therefore, the Inverted List length and location list length that word is corresponding is all very long.Embodiments of the invention propose to set up in the process at index, with high frequency words with its before and after the word that occurs carry out the combination of certain mode, formation high frequency cascade word, and the position that high frequency cascade word is set is that former lexeme is put.When retrieval, submit to the retrieval request string to carry out identical processing to the user, will to the inquiry of high frequency words, replace to the inquiry of high frequency cascade word.Because, high frequency cascade word collection of document and in the monolithic document frequency of occurrences therefore greatly reduced the scale that need to carry out AND-operation and position calculation all far below former high frequency words, Effective Raise the speed of string retrieval, do not lose the inquiry correctness.

Set up the overall technical architecture of index as shown in figure 11:

Specifically comprise: step 1102 judges whether to carry out the document of index.

Step 1104 reads the document for the treatment of index.

Step 1106 is carried out participle and position mark to document.

Step 1108 is carried out cascade to high frequency words and is processed.

Step 1110 is added the index that generates to index database.

The processing logic of high frequency cascade word is as follows, at first input text is carried out participle, position mark processing, generates and is just arranging index, and other steps are as follows:

Step 1202 reads the word of just arranging index successively.

Step 1204 judges whether in addition word, is then to enter step 1206, otherwise end operation.

Step 1206, according to the high frequency vocabulary that generates in advance, each word in the row's of aligning index filters, and judges whether it is high frequency words.If non-high frequency words then enters step 1208, otherwise, step 1204 returned.

Step 1208 is analysed the word of this word front.If front word exists, enter step 1210, do not exist then to enter step 1214.

Step 1210, whether word is high frequency words before judging, is then to enter step 1214, otherwise enters step 1212.

Step 1212 forms neologisms with first Chinese word of front word (if English word, then getting the first character of word) and this word.

Step 1214 is analyzed the word of this word back.If rear word exists, enter step 1216, otherwise return step 1204.

Step 1216, whether word is high frequency words after judging, is then to enter step 1218, otherwise enters step 1220.

Step 1218 then makes up this word and rear word, generates high frequency cascade word, and the position of high frequency cascade word is recorded as the position of current word.

Step 1220 if rear word exists, and non-high frequency words, then makes up first Chinese word of this word and rear word (if English word, then getting the first character of word).

Step 1222 for neologisms add the cascade label symbol, generates high frequency cascade word.

Step 1224 is recorded as current location with the position of high frequency cascade word, is inserted into and just arranges index.

The cascade label symbol can be any one not symbol of participation index and retrieval.Native system is in order to express easily chosen " # " as the cascade label symbol.Because therefore not participation index and the retrieval of cascade label symbol can not comprise high frequency cascade word in the normal word segmentation result by the word-dividing mode generation, can not produce conflict.

For example, index is set up in the process, for document: " my university is very beautiful ", cut word, the lexeme tagging after processing is:

(I, 1)/(, 2)/(university, 3)/(very, 4)/(beauty, 5)

By high frequency vocabulary inquiry, can know " I " and " " be high frequency words, carry out high frequency cascade word processing logic, the just row document that needs to set up index after the processing is as follows:

(I, 1)/(my # of #, 1)/(, 2)/(the large # of #, 2)/(university, 3)/(very, 4)/(beauty, 5)

For the above-mentioned document of just arranging, according to the normal process logic, set up inverted entry.(I, 1), (large, 2) are high frequency cascade glossarial index items.The string query script also carries out same processing logic.Such as, for string query requests " my university ", cut after word, lexeme tagging are processed and be:

(I, 1)/(, 2)/(university, 3)

Carry out high frequency cascade word processing logic, obtain the set of final retrieval lexical item:

(I, 1)/(my # of #, 1)/(, 2)/(the large # of #, 2)/(university, 3)

Retrieval phase, string retrieval logic only needs query set:

(I, 1)/(large, 2)/(university, 3)

Retrieve, at first read the Inverted List of " I ", " large ", " university ", carry out " with " retrieval of logic.For the document that comprises above-mentioned three words, read the list of locations information of three words, adopt certain method, judge whether order occurs each word.Only when three word orders appearance, the statistics occurrence number is with this factor of estimating as correlativity.Obviously, the frequency of occurrences of " my # of # ", " the large # of # " far below " I ", " ", the tabulation of its inverted index and list of locations length are all far little a latter, reduces greatly calculated amount, has improved retrieval rate.

The adding of high frequency cascade word has increased the size of search engine vocabulary, has also increased the scale of inverted entry listing file and position paper simultaneously.Consider that based on search efficiency search engine imports lexicon file in the internal memory usually in running status.In theory, high frequency cascade word generation module may produce 2*n*n+2*n*m neologisms, and wherein, n is the number of high frequency words, and m is the number of Chinese character and English alphabet, because the cascade direction, the number of word need to take advantage of 2.But because document meets certain grammer pragmatic rules usually, the neologisms that produce in the actual text Index process will be much smaller than theoretical value.By the value of control n, m, the scale of newly-generated word can be controlled within the acceptable scope of present run mode retrieval server hardware internal memory.Inverted entry listing file and position paper all are stored in disk, and the growth on its scale does not bring the accessibility loss of energy.Thus, high frequency cascade word processing policy changes the method for time by the space, has significantly improved string in the retrieval and has searched efficient with frequency computation part, has improved retrieval rate.

In sum, according to technical scheme of the present invention, can realize indexing unit, indexing means, indexing unit, search method and searching system, document index and retrieving at search engine, high frequency words and front and back word are just carried out the combination of certain mode, form high frequency cascade word and carry out index, in retrieval phase, substitute former high frequency words with high frequency cascade word and participate in retrieval.Because the Inverted List length and location list length of high frequency cascade word is much smaller than former high frequency words, search operand with statistical string frequency thereby greatly reduced string in the retrieving, when guaranteeing the retrieval accuracy, significantly improved effectiveness of retrieval.The present invention considered under the computer hardware environment, for present traditional index structure, solved with the string retrieval calculated amount of high frequency words greatly, and slow-footed problem exchanges upper fast query of time for limited space resources, improves user's experience.

More than be described with reference to the accompanying drawings technical scheme of the present invention, by being carried out expanded set, described high frequency words is combined into new keyword, and utilize this combination keyword to set up index database and retrieve, thereby under existing computer hardware environment, under the prerequisite that guarantees the correlativity accuracy rate, utilize limited space resources to realize the Effective Raise of recall precision, promote the user and experience.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an indexing unit is characterized in that, comprising:

The high frequency words processing module when the current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;

Module set up in index, sets up index according to neologisms and described document that expansion obtains.

2. indexing unit according to claim 1 is characterized in that, described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.

3. indexing unit according to claim 1, it is characterized in that, described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with the most front at least one in the described rear side word or character and described current word combination, to form described neologisms.

4. an indexing means is characterized in that, comprising:

When step 202, the high frequency words processing module current word in document is high frequency words, according to front side word and/or the rear side word of described current word adjacency, described current word is expanded;

Step 204, index are set up module and are set up index according to neologisms and described document that expansion obtains.

5. indexing means according to claim 4 is characterized in that, described step 202 specifically comprises:

Described high frequency words processing module is when described front side word and/or described rear side word also are high frequency words, with described front side word and/or described rear side word and described current word combination, to form described neologisms.

6. indexing means according to claim 4 is characterized in that, described step 202 specifically comprises:

Described high frequency words processing module is when described front side word and/or described rear side word are non-high frequency words, with at least one last in the word of described front side word or character and described current word combination, and/or with at least one the most front in described rear side word word or character and described current word combination, to form described neologisms.

7. an indexing unit is characterized in that, comprising:

The high frequency words processing module when the current word in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;

Retrieval module, the neologisms that use expansion to obtain are retrieved in pre-established index.

8. indexing unit according to claim 7 is characterized in that, also comprises:

Such as each described indexing unit in the claim 1 to 4, pre-established described index.

9. indexing unit according to claim 8 is characterized in that, described high frequency words processing module is also added mark in the both sides of described neologisms;

Described retrieval module obtains described neologisms according to described mark, and adds up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and chooses document as result for retrieval according to the correlativity that obtains.

10. a search method is characterized in that, comprising:

Step 402 when the current word of high frequency words processing module in the retrieval string is high frequency words, according to front side word and/or the rear side word of described current word adjacency, is expanded described current word;

Step 404, the neologisms that retrieval module obtains according to expansion are retrieved in pre-established index.

11. search method according to claim 10 is characterized in that, before described step 404, also comprises:

By such as each described indexing means in the claim 4 to 6, pre-established described index.

12. search method according to claim 11 is characterized in that, in described step 402, also comprises:

Described high frequency words processing module is added mark in the both sides of described neologisms;

Described step 404 specifically comprises: described retrieval module is according to described mark, obtain described neologisms, and add up the described neologisms number of times that order occurs in described document, being used to described document calculations correlativity, and choose document as result for retrieval according to the correlativity that obtains.

13. a searching system is characterized in that, comprising:

Each described indexing unit in the claims 1 to 3;

Each described indexing unit in the claim 7 to 9, described indexing unit uses the neologisms of its generation, retrieves in the index that described indexing unit is set up.