A kind of indexes on-line updating method of text retrieval system
Technical field
The invention belongs to intelligent information processing technology, what be specifically related to is a kind of indexes on-line updating method of text retrieval system.
Background technology
Along with computer technology and rapid development of network technique, the sharp increase of electronic document number.How to search needed data information fast, comprehensively, exactly in the information the inside of this magnanimity has become people's question of common concern, has also become a heat subject in the research field.Most of electronic document is the non-structured text data of being write as with natural language, and global search technology is an important means of handling text data at present.
Full-text search has multiple implementation, comprises inverted index, suffix array and signature file etc.
The corresponding relation of general index is the correspondence of from " number of documents " to " the document all speech ".Inverted index becomes from " speech " this relation the other way around to " all number of documents that this speech occurs ", thus can be apace by word and search to all documents that these speech occur.In the practical application, usually also can comprise information such as number of times that speech occurs and particular location in the inverted index in document.Retrieval for convenience, inverted list is normally orderly.
Below be giving an example of inverted index:
Be provided with two pieces of articles 1 and 2:
The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: He once lived in Shanghai.
1) at first we will obtain the keyword of these two pieces of articles, and we need take following treatment measures usually:
A. we have plenty of article content now, i.e. character string, and we will find out all words in the character string, i.e. participle earlier.English word is owing to use space-separated, relatively good processing.Between the Chinese word is the special word segmentation processing of needs that connects together.
B. in the article " in ", " once " speech such as " too " do not have any practical significance, in the Chinese " " word such as "Yes" do not have concrete implication usually yet, on behalf of the speech of notion, these to filter out.
Can be when c. the user wishes to look into " He " usually containing " he ", the article of " HE " is also found out, so capital and small letter need be unified in all words.
Can be when d. the user wishes to look into " live " usually containing " lives ", the article of " lived " is also found out, so need " lives ", " lived " is reduced into " live ".
E. the punctuation mark in the article is not represented certain conception of species usually, can filter out yet.
Through after the top processing, all keywords of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou].
All keywords of article 2 are: [he] [live] [shanghai].
2) keyword has been arranged after, we just can set up inverted index.Above corresponding relation be: " article number " is to " all keywords in the article ".Inverted index turns this relation around, becomes: " keyword " is to " have all articles of this keyword number ".Article 1,2 is through becoming behind the row:
Keyword article number
guangzhou?1
he?2
i?1
live?1,2
shanghai?2
tom?1
Usually only know keyword occurs not enough in which article, we also need to know the position of keyword occurrence number and appearance in article, two kinds of positions are arranged usually: a) character position, promptly write down this speech and be which character in the article (advantage be keyword bright when apparent the location fast); B) keyword position, promptly writing down this speech is which keyword in the article (advantage is to save index space, phrase (phase) inquiry soon).
After adding " frequency of occurrences " and " position occurring " information, our index structure becomes:
The position appears in keyword article number [frequency of occurrences]:
guangzhou?1[2]3,6
he?2[1]1
i?1[1]4
live?1[2],2[1]2,5,2
shanghai?2[1]3
tom?1[1]1
We illustrate that this structure: live has occurred 2 times in article 1 with this behavior example of live, occurred once in the article 2, what is its appearance position that this represents " 2; 5,2 "? we need analyze in conjunction with the article number and the frequency of occurrences, have occurred in the article 12 times, so " 2; 5 " just represent two positions that live occurs in article 1, occurred once in the article 2 that remaining " 2 " just represent that live is the 2nd key word in the article 2.
The suffix array indexing is the very high text index structure of a space efficiency that was proposed in 1993 by Manber and Myers, this structure has write down the dictionary sequence index of each suffix in the text, and it deposits all suffix in the text tabulation of its reference position in text according to the dictionary preface.
The signature document is meant the bit string that the keyword in the document is hashed to the F position, and the keyword of the former document of sequential access deposits the bit string of hash gained in file successively.
Below be its matching idea: suppose that we will judge now whether character string A and character string B mate, and at first hash to digital hash (A) and hash (B) to A and B respectively, if hash (A)!=hash (B) then A!=B; Yet hash (A)=hash (B) can not illustrate A=B.
Be concrete coupling example below:
Keyword x[0..5]: A A C T C T Hash (x[0..5])=17579;
Text y[0..9]: G C A A C T C T C A Hash (y[0..5])=17819;
Text y[0..9]: G C A A C T C T C A Hash (y[1..6])=17533;
Text y[0..9]: G C A A C T C T C A Hash (y[2..7])=17579.
Signature file has the following advantages:
1) file organization is simple, the former document sequence consensus of fundamental sum;
2) safeguard easily that generation is inserted, and deletes all very convenient;
3) requisite space is little, particularly adopts after the superimposed coding.
Wherein inverted index is most widely used mode, and it has good performance for the inquiry based on word.
In actual applications, normally what constantly change, new content can be added collection of document, and out-of-date content can deleted or renewal.If along with the variation of collection of document, index is not in time upgraded, the quality of result for retrieval will constantly descend, and retrieval is less than initiate document, perhaps retrieves not exist or document that content has changed.Therefore, the necessary continuous updating of index is so that in time reflect the variation of collection of document.
The simplest mode of index upgrade is that off-line is rebuild index, abandons out-of-date index database that is:, rebuilds index fully with up-to-date data.The Web search engine requires height because more amount of new data is big to recall precision, takes this mode in early days more.
The mode another kind of commonly used of index upgrade is an online updating.Typical online updating method is the update strategy that people such as Clarke adopts in text retrieval system MultiText.The index structure of MultiText is deposited in the mode of an end to end ring file on disk.(common file system is not directly supported the file of annular, but can be by level of abstraction ordinary file analog loop shape file.) at any time, this file all is made up of 3 continuous parts: index to be updated, the index that has upgraded and free space.
During retrieval, at first need the deterministic retrieval condition in which part of index.Because index is pressed lexicographic order and is arranged on disk, only need remember the border of this two parts index, need not to visit disk.(to be updated and upgraded) all has complete inverted index structure because two parts index, can use usual way to find index entry, ideally only needs a disk access just can obtain required postinglist (position array).
During renewal, the new document that adds is temporarily stored in the core buffer through handling the posting that generates.A background process constantly reads the part to be updated of index, after merging with posting in the internal memory, appends to the more end of new portion.In this process, part to be updated constantly shortens, and more new portion constantly increases, till part to be updated all changes more new portion into.
Though the online updating strategy of MultiText has been realized the continuous updating of index, and have recall precision preferably, also have multinomial deficiency:
Only be applicable to and add new document, be not suitable for the application of frequent deletion and modification document;
Can not guarantee real-time, newly-increased document will guarantee to be arrived by user search, will wait for a complete update cycle at least;
Can not guarantee consistance, in merging process, dictionary is divided into all the time and has upgraded and do not upgrade two parts, when the newly-increased document of retrieval, can retrieve in the time of can having and retrieve sometimes less than situation.
By the analysis of front as can be seen, the difficulty of index upgrade is often to need to rewrite most of index database in order to upgrade a few documents, though in the index database most documents with current upgrade irrelevant.With MultiText is example, even in order to upgrade one piece of document, also need to rewrite whole index database.
Summary of the invention
The objective of the invention is to propose a kind of indexes on-line updating method of new text retrieval system, make under the situation of the search function that does not influence text retrieval system, guarantee the real-time and the consistance of index upgrade.
Specific implementation method of the present invention is: a kind of indexes on-line updating method of new text retrieval system may further comprise the steps:
1) with the index database separated into two parts: master index storehouse and secondary index storehouse; Described secondary index storehouse is identical with the structure in master index storehouse, and described secondary index storehouse is complete is stored on internal memory and the disk, is responsible for temporary recently newly-increased document;
2) read the content of index to be updated;
3) action type of judgement index to be updated is newly-increased or deletion action, carries out following processing respectively:
A: newly-increased in this way operation, the content of adding index to be updated in the secondary index storehouse,
B: deletion action in this way, in the secondary index storehouse, preserve document deletion information, described document deletion information adopts boolean vector to preserve, and each document is corresponding to one of boolean vector.
The criteria for classification in described master index storehouse and secondary index storehouse is: described master index storehouse is formed by accounting for most documents that seldom changes, and secondary index is made up of a few documents of frequent change.
Further, judge whether secondary index needs to merge in the master index, merge if desired that secondary index and document deletion information that need are merged merge in the master index, and empty secondary index and the document deletion information that has merged.
Further, judge whether to be still waiting to upgrade the content of index, if having then jump to step 2), otherwise, judge whether to stop upgrading the request of index, if any, end operation, otherwise, proceed decision operation after waiting for a period of time.
Judge whether secondary index needs to merge in the master index, carry out according to the standard of following A, B or C:
A: the document number that sets in advance the file size of secondary index or hold when file size that surpasses setting or document number, then merges;
B: when the busy extent of system is lower than default parameter, then merge;
Both combinations of C:A, B.
Master index and secondary index can be index structure forms such as inverted index, suffix array and signature file.
The concrete classification in master index storehouse and secondary index storehouse needs decide according to concrete applied environment, comprise application the data total amount, the every day/per hour newly-increased data volume, hardware configuration situation.
Effect of the present invention is: among the present invention by utilizing secondary index to realize the index online updating of text retrieval system, thereby reach the real-time that under the situation of the search function that does not influence text retrieval system, guarantees index upgrade and the purpose of conforming index online updating.Experiment shows, under common PC environment (CPU is P42.0G, in save as 1.0GB), the index real-time online that the full-text search that the present invention realizes reaches upgrades and guarantee the purpose of integrality.Work as the secondary index number of files in the experiment less than 10,000 o'clock, newly-increased operation has very fast speed (all below 0.3 second), and deletion action speed is not influenced by secondary index, and retouching operation is deletion action and the combination of adding operation, both sums that is about consuming time.
Description of drawings
Fig. 1 is the process flow diagram of the method for the invention.
Embodiment
Below in conjunction with accompanying drawing the specific embodiment of the present invention is described in detail.
The renewal operation of finding index database in the practical application has locality usually.According to these characteristics, in the inventive method with the index database separated into two parts: account for master index that most documents that seldom changes forms and the document that often changes recently and form little secondary index.
The overwhelming majority here, seldom change need decide according to concrete applied environment, comprise that data total amount, every day or per hour newly-increased data volume, the hardware configuration situation etc. of application decide.For example: in application, be received within the secondary index storehouse in will upgrading every day, when midnight, system was idle, secondary index merged in the master index.
Because the secondary index capacity is little, renewal operation thereon can be finished very soon, has guaranteed real-time; And it is all less to upgrade operation required time, temporary space and computational resource, thereby has avoided upgrading step by step the consistency problem that brings.
Be placed on the disk the master index if secondary index resembles, then can introduce performance issue.Because the performance of retrieval depends on magnetic disc access times, if secondary index is placed on the disk, the retrieval that just can finish of disk access so originally needs twice disk access at least in this method, expense is big nearly one times.Since the secondary index size much smaller than master index, can all be placed in the internal memory fully.But consider consistency problem, secondary index can not only be placed on internal memory, in case otherwise system break down, the full content of secondary index will be lost, index database is just imperfect.Therefore, also need a backup on the disk.
Secondary index among the present invention is: identical with the master index structure, but complete being stored on internal memory and the disk of while is responsible for the temporary index that increases document recently newly.
Below specific implementation method of the present invention is given an example.
The present invention's (CPU is P42.0G, in save as 1.0GB) under common PC environment experimentizes, and realizes the index online updating of full-text search according to method of the present invention.
As shown in Figure 1, specifically may further comprise the steps:
1) reads the content of index yet to be built;
2) if the action type of index yet to be built is to revise document then this retouching operation is resolved into deletion action to operate with newly-increased;
3) if action type is to increase document newly then execution in step 4, if action type is to delete document then execution in step 8;
4) on secondary index, add the index that increases document content newly;
5) judge whether secondary index needs to merge in the master index, merge if desired that then execution in step 6, otherwise jump to step 9;
6) secondary index and document deletion information are merged in the master index;
7) empty secondary index and the boolean vector that is used to preserve document deletion information, jump to step 9;
8) for the deletion document function, the corresponding positions of preserving the boolean vector of document deletion information is set to 1;
9) judge whether to also have the content of index yet to be built, if having then jump to step 1, otherwise execution in step 10;
10) judge whether to stop building the request of index,, otherwise jump to step 9 after waiting for a period of time if having then withdraw from.
For the deletion action in the above-mentioned steps, use a boolean vector to handle deletion action.The corresponding one piece of document of each of this boolean vector.When deleting one piece of document just correspondence the position be set to " 1 ".Retrieval and index merge algorithm all can be skipped and to correspond to " 1 " document, reached the effect of deletion from application point, carrying out index when merging, these are denoted as " 1 " document owing to being skipped by merge algorithm, will really from master index, disappear.
Because the present invention both can adopt the mode of inverted index structure, also can adopt the mode of filling suffix array and signature file retrieval, not different on method of operating.Adopt the index structure of inverted index structure in this experiment, whether carry out the merging of index by judging the number of documents and the system's busy extent decision that comprise in the secondary index as master index and secondary index.
The data that experiment is selected for use are the news category Chinese web pages that grasp from the Internet, and the news content that extracts webpage is as text, and each file is one piece of Press release, totally 100 ten thousand pieces, are total to 2.68GB.
Following two problems of the main investigation of experiment:
How long can does one piece of document of renewal need after using secondary index, requirement of real time?
Use secondary index how recall precision is influenced?
Use in the experiment that different secondary index size (being unit with the document number of holding) has measured that increment is newly-increased, deletion, upgrade one piece of document required averaging time.Experimental result is as shown in table 1, when the secondary index number of files less than 10,000 o'clock, newly-increased and deletion action has very fast speed (all below 0.3 second).That is to say that as long as the secondary index number of files is limited in below 10,000, the inventive method has good real-time.Simultaneously, it can also be seen that from experimental result deletion action speed is not influenced by secondary index, and retouching operation is deletion action and the combination of adding operation, both sums that is about consuming time.Description of test the inventive method has good real-time.
Table 1: the master index number of files is 100 ten thousand o'clock, and the increment index time overhead is with secondary index number of files situation of change
The secondary index number of files |
Add (second) consuming time |
Delete (second) consuming time |
Revise (second) consuming time |
1 |
0.010 |
0.205 |
0.220 |
10 |
0.042 |
0.213 |
0.292 |
100 |
0.051 |
0.204 |
0.271 |
1,000 |
0.070 |
0.244 |
0.306 |
10,000 |
0.223 |
0.200 |
0.439 |
100,000 |
3.31 |
0.376 |
4.05 |
The experimental result of table 2 shows that it doesn't matter for the update time of index and the size of master index, because the renewal process of index is to have carried out on secondary index fully.
Table 2: the secondary index number of files is 10,000 o'clock, and the increment index time overhead is with master index number of files situation of change
The master index number of files |
Add (second) consuming time |
Delete (second) consuming time |
Revise (second) consuming time |
1,000 |
0.223 |
0.200 |
0.439 |
10,000 |
0.223 |
0.200 |
0.439 |
100,000 |
0.223 |
0.200 |
0.439 |
1,000,000 |
0.223 |
0.200 |
0.439 |
In order to investigate the influence of secondary index to retrieval rate, retrieve with 100 terms in the experiment, calculate and retrieve averaging time.Experimental result sees Table 3.Time with no secondary index is benchmark, and the part of increase can be regarded the expense of secondary index as.The secondary index size is 10000 when following, and expense is all less than 5%, can be described as that the user can't perception.
Table 3: the master index number of files is 100 ten thousand o'clock, and retrieval rate is with secondary index number of files situation of change
The secondary index size |
Retrieve (second) consuming time |
The secondary index expense |
0 |
0.422 |
0% |
1 |
0.430 |
1.8% |
10 |
0.429 |
1.7% |
100 |
0.431 |
2.1% |
1,000 |
0.433 |
2.6% |
10,000 |
0.439 |
4.1% |
100,000 |
0.981 |
132% |
Comprehensive above experimental result, the method that the present invention proposes has realized the index online updating of text retrieval system, has under the situation of good retrieval performance, guarantees the real-time and the consistance of index upgrade.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.