CN100495400C - Indexes on-line updating method of full text retrieval system - Google Patents

Indexes on-line updating method of full text retrieval system Download PDF

Info

Publication number
CN100495400C
CN100495400C CN 200610112800 CN200610112800A CN100495400C CN 100495400 C CN100495400 C CN 100495400C CN 200610112800 CN200610112800 CN 200610112800 CN 200610112800 A CN200610112800 A CN 200610112800A CN 100495400 C CN100495400 C CN 100495400C
Authority
CN
China
Prior art keywords
index
secondary
document
full
main
Prior art date
Application number
CN 200610112800
Other languages
Chinese (zh)
Other versions
CN101136016A (en
Inventor
缙 刘
吴於茜
李月敏
杨建武
Original Assignee
北大方正集团有限公司;北京大学;北京北大方正技术研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北大方正集团有限公司;北京大学;北京北大方正技术研究院有限公司 filed Critical 北大方正集团有限公司;北京大学;北京北大方正技术研究院有限公司
Priority to CN 200610112800 priority Critical patent/CN100495400C/en
Publication of CN101136016A publication Critical patent/CN101136016A/en
Application granted granted Critical
Publication of CN100495400C publication Critical patent/CN100495400C/en

Links

Abstract

本发明公开了一种新的全文检索系统的索引在线更新方法。 The present invention discloses a new method for updating index line full-text retrieval system. 该方法通过辅助索引实现全文检索系统的索引在线更新。 The method enables full-text search indexing system through secondary indexes online updates. 将新增文档加入到辅助索引中,将被删除文档采用布尔向量标识,从而实现索引的实时更新并保证索引的一致性,并通过与主索引的合并避免系统长期更新操作导致的系统性能下降。 The new document is added to the secondary index, the document will be deleted using Boolean vector logo, enabling real-time updates of the index and ensure the consistency of the index, and update the operating system to avoid long-term system performance degradation caused by merging with the main index. 该发明方法实现了在不影响全文检索系统的检索功能的情况下,保证索引更新的实时性和一致性。 The invention is a method to achieve without affecting full-text retrieval search system to ensure the timeliness and consistency of index updates.

Description

一种全文检索系统的索引在线更新方法 One kind of full-text search indexing system online update method

技术领域 FIELD

本发明属于智能信息处理技术,具体涉及的是一种全文检索系统的索引在线更新方法。 The present invention belongs to the intelligent information processing technology, particularly relates to a method for updating the index line full-text retrieval system.

背景技术 Background technique

随着计算机技术及网络技术的迅速发展,电子文档数目急剧增长。 With the rapid development of computer technology and network technology, electronic sharp increase in the number of documents. 如何在这海量的信息里面快速、全面、准确地查找所需要的资料信息已经成为人们普遍关注的问题,也成了研究领域内的一个热门课题。 How fast, comprehensive, accurate information they need to find information in this flood of information which has become a widespread concern, has become a hot topic of research in the field. 大部分电子文档是用自然语言写成的非结构化的文本数据,全文检索技术是目前处理文本数据的重要手段。 Most electronic documents are written in natural language unstructured text data, full-text search technology is an important means to deal with the current text data.

全文检索有多种实现方式,包括倒排索引,后缀数组和签名文件等。 There are many ways to achieve full-text search, including the inverted index, suffix array and signature files.

一般索《! General Cable "! 的对应关系是从"文档号"到"文档中所有的词"的对应。 The correspondence is correspondence from the "Document Number" to "document all the Words". 倒排索引把这种关系倒转过来,变成从"词,,到"出现该词的所有文档号",从而能快速地適 Inverted index this relationship reversed, all documents become number "from the word appears" to the word ,, ", which can be adapted quickly

过词检索到出现这些词的所有文档。 Word had retrieved all documents appear these words. 实际应用中,倒排索引中通常还会包括,! Practical applications, the inverted index also typically include!

在文档中出现的次数以及具体位置等信息。 Information on the number of occurrences in the document as well as the specific location. 为了方便检索,倒排表通常是有序的。 To facilitate the search, posting list is generally orderly.

以下是倒排索引的举例: 设有两篇文章1和2: The following is an inverted index Example: has two articles 1 and 2:

文章1的内容为:Tom lives in Guangzhou,I live in Guangzhou too。 1 content of the article is: Tom lives in Guangzhou, I live in Guangzhou too. 文章2的内容为:He once lived in Shanghai 。 2 content of the article is: He once lived in Shanghai.

l)首先我们要取得这两篇文章的关键词,通常我们需要采取如下处理措 l) we need to get these two articles keywords, usually we need to take the following measures to deal with

施: Shi:

a,我们现在有的是文章内容,即一个字符串,我们先要找出字符串中的所有单词,即分词。 a, we are now plenty of content of the article, namely a string, we need to find all the words in a string that word. 英文单词由于用空格分隔,比较好处理。 Because English words separated by spaces, the better deal. 中文单词间是连.在一起的需要特殊的分词处理。 Between Chinese words are linked. Together requires special word processing.

b. 文章中的,,in", "once" "too"等词没有什么实际意义,中文中的"的""是"等字通常也无具体含义,这些不代表概念的词可以过滤掉。 b. article ,, in "," once "" too "and other words of little practical significance, in Chinese," the "," is "words are usually no specific meaning of these words does not mean that the concept can be filtered out.

c. 用户通常希望查"He,,时能把含"he", "HE"的文章也找出来,所以所有单词需要统一大小写。 c. Users typically want to check "can contain when He ,," he "," HE "the article also to find out, so all the words need to be unified case.

d. 用户通常希望查"live,,时能把含"lives,,, "lived,,的文章也找出来,所以需要把"lives", "lived"还原成"live"。 d. Users typically want "can contain when live ,," lives ,,, Charles "lived ,, article also to find out, so it is necessary to" lives "," lived "reduced to" live ".

e. 文章中的标点符号通常不表示某种概念,也可以过滤掉。 e. article punctuation generally does not represent a concept, it can be filtered out. 经过上面处理后,文章1的所有关键词为:[torn] [live] [uangzhou] [i]卩ivej After the above treatment, all articles keyword 1 is: [torn] [live] [uangzhou] [i] Jie ivej

[guangzhou]。 [Guangzhou].

文章2的所有关键词为:[he] [live] [shanghai]。 All articles keyword 2 is: [he] [live] [shanghai].

2)有了关4走词后,我们就可以建立倒排索引了。 After 2) With 4 to go off the word, we can build the inverted index. 上面的对应关系是:"文章号,,对"文章中所有关键词"。倒排索引把这个关系倒过来,变成:"关键词" 对"拥有该关键词的所有文章号"。文章l, 2经过倒排后变成: The above correspondence is: "No ,, article on" all of the words in the article. "Inverted index to reverse this relationship, becomes:" keyword "on the" all articles have the keyword number "article l. 2 after inverted through becomes:

关键词文章号 Keywords article number

guangzhou 1 guangzhou 1

he 2 he 2

il il

live 1,2 shanghai 2 torn 1 live 1,2 shanghai 2 torn 1

通常仅知道关键词在哪些文章中出现还不够,我们还需要知道关键词在文章中出现次数和出现的位置,通常有两种位置:a)字符位置,即记录该词是文章中第几个字符(优点是关键词亮显时定位快);b)关键词位置,即记录该词是文章中第几个关键词(优点是节约索引空间、词组(phase)查询快)。 Usually know only keywords in the article which appeared not enough, we also need to know the number of occurrences of keywords and location appear in the article, there are usually two positions: a) the character position, that the term was first recorded in several articles characters (keyword highlighted the advantage of fast positioning); b) keyword position, i.e., the recording paper is the term in the first few keywords (index advantage is to save space, the phrase (Phase) query faster).

加上"出现频率"和"出现位置"信息后,我们的索引结构变为:关键词文章号[出现频率]出现位置: guangzhou 1 [2〗3 , 6 he2[1] 1 il[l]4 With "occurrence frequency" message and the "position a", our index structure becomes: Article No. Keywords [occurrence frequency] appearance position: guangzhou 1 [2〗 3, 6 he2 [1] 1 il [l] 4

livel[2],2[i〗2, 5, 2 shanghai 2[1] 3 tom l[l] 1 livel [2], 2 [i〗 2, 5, 2 shanghai 2 [1] 3 tom l [l] 1

以live这行为例我们说明一下该结构:live在文章1中出现了2次,文章2中出现了一次,它的出现位置为"2,5,2,,这表示什么呢?我们需要结合文章号和出现频率来分析,文章1中出现了2次,那么"2,5"就表示live在文章1中出现的两个位置,文章2中出现了一次,剩下的"2"就表示live是文章2中第2 个关键字。 To live this example we explain the behavior of the structure:? Live there have been two times in article 1, article 2 appear once, its location appears as "2,5,2,, What does this say then we need to combine articles number and frequency of occurrence analysis, the article appeared in 1 twice, then the "2,5", says the two positions live appearing in article 1, appears once in the article 2, the rest of the "2", says live article 2 is the first two keywords.

后缀数组索引是由Manber和Myers在1993年提出的一个空间效率非常高的文本索引结构,这种结构记录了文本中各后缀的字典序索引,它把文本中的所有后缀按照词典序存放其在文本中起始位置的一个列表。 Suffix array index is a space efficient and Myers proposed by the Manber in 1993 very high text index structure, which recorded a lexicographical index for each suffix text, it all suffixes in the text according to the dictionary in order to store it a list of text starting position.

签名文档是指把文档中的关键词散列成F位的位串,顺序访问原文档的关键词,把散列所得的位串依次存入文件。 Signing a document refers to the document keyword hash into a bit string F bit sequential access the original document key words, the resulting bit string is sequentially stored hash file.

以下是其匹配思想:假设我们现在要判断字符串A和字符串B是否匹配; 首先把A和B分别散列成数字hash(A)和hash(B),如果hash (A) != hash (B)則A!-B;然而hash(A)-hash(B)不能说明A =B。 The following is thought that match: Suppose we now Analyzing string A and string B match; A and B are first put into a digital hash of the hash (A) and the hash (B), if the hash (A) = hash (! B) then A -B;! however hash (A) -hash (B) can not explain A = B.

下面是具体的匹配例子: The following are specific examples of match:

关键词x[0"5]: AACTCTHash(x[0..5])= 17579; Key words x [0 "5]: AACTCTHash (x [0..5]) = 17579;

文;^y[0"9 ]: GCAACTCTCA Hash( y[0..5〗)- 17819; Text; ^ y [0 "9]: GCAACTCTCA Hash (y [0..5〗) --17819;

文本y[0..9 ]: GCAACTCTCA Hash( y[l ..6] ) = 17533; Text y [0..9]: GCAACTCTCA Hash (y [l ..6]) = 17533;

文本y[0..9 ]: GCAACTCTCA Hash( y[2..7]) = 17579。 Text y [0..9]: GCAACTCTCA Hash (y [2..7]) = 17579.

签名文件具有以下优点: Signature file has the following advantages:

1)文件组织简单,基本和原文档顺序一致;2) 维护容易,生成,插入,删除都很方便; 1) simple file organization, and the same basic order of the original document; 2) easy to maintain, generate, insert, delete, very convenient;

3) 所需空间小,特别是采用重叠编码之后。 3) a small space required, especially after encoding by overlap.

其中倒排索引是应用最广泛的方式,它对于以单词为基础的查询具有很好的性能。 Which inverted index is the most widely used way, it is for the word-based query with good performance.

在实际应用中,文档集合通常是在不断变化的,新的内容会被添加进来, 过时的内容会被删除或更新。 In practice, the document set is typically in a changing, new content will be added in, outdated content is deleted or updated. 如果随着文档集合的变化,不对索引及时进行更新,检索结果的质量将会不断下降,检索不到新加入的文档,或者检索到已经不存在或者内容已经改变的文档。 If the changes with the document collection, not timely update the index, the quality of search results will continue to decline, can not retrieve newly added document, or the document has been retrieved content has changed or does not exist. 因此,索引必须持续更新,以便及时反映X 档集合的变化。 Therefore, the index must be continuously updated to reflect changes in X file sets.

索引更新最简单的方式是离线重建索引,即:抛弃过时的索引库,用最新的数据完全重建索引。 The easiest way to update the index is offline rebuild the index, namely: to abandon outdated index database, with the latest data completely rebuild the index. Web检索引擎由于更新数据量大,对检索效率要求高、 早期多采取这种方式。 Web search engine due to the large amount of data update on the search for high efficiency, the early take more this way.

索引更新的另一类常用的方式是在线更新。 Another type of index update common way is online updates. 典型的在线更新方法是Clarice 等人在全文检索系统MultiTbxt中采用的更新策略。 A typical online update is to update the strategy Clarice, who used the full-text retrieval system MultiTbxt in. MultiTbxt的索引结构在磁盘上以一个首尾相连的环形文件的方式存放。 MultiTbxt index structure in the manner of a ring-shaped end to end of the file is stored on the disk. (通常的文件系统并不直接支持环形的文件,但可以通过一个抽象层用普通文件模拟环形文件。)在任何时候, 这个文件都由3个连续的部分组成:待更新的索引、已更新的索引和空闲空间„ (Normal file system does not directly support ring file, but can be a regular file abstraction layer simulation ring file.) At any time, the file is composed of three successive parts: the index to be updated, the updated index and free space. "

检索时,首先需要确定检索条件在索引的哪一部分中。 Retrieving, first need to determine which part of the search condition in the index. 由于索引在磁盘丄按字典顺序排列,只需要记住这两部分索引的边界,无需访问磁盘。 Since the index are arranged in alphabetic order Shang disk, just remember that border the two parts of the index, without accessing the disk. 因为两都分索引(待更新和已更新的)都具有完整的倒排索引结构,可以使用通常的方法找到索引项,理想情况下只需要一次磁盘访问就可以取得所需的posting list (位置数组)。 Because the two are sub-index (to be updated and updated) have full inverted index structure, you can use the usual method to find the index entry has only one disk access Ideally, you can obtain the required posting list (position array ).

更新时,新添加的文档经过处理生成的posting暂存在内存緩冲区中。 When updating, the newly added document generated through the posting process in the temporary memory buffer. 一个后台ite不断地读取索引的待更新部分,与内存中的posting合并后,附加到已更新部分的末尾。 Ite a background constantly updates the index portion to be read, and the combined posting memory, appended to the end portion has been updated. 在此过程中,待更新部分不断缩短,而已更新部分不断增长,直至待更新部分全部转变为已更新部分为止。 In this process, the part to be updated constantly to shorten, just update some growing up to be part of all updates into a section has been updated so far. MultiText的在线更新策略虽然实现了索引的持续更新,并且具有较好的检索效率,但还存在多项不足: MultiText online update strategy to achieve sustained, although the index is updated, and has better search efficiency, but there are still a number of shortcomings:

只适用于添加新文档,不适用于频繁删除和修改文档的应用; >不能保证实时性,新增文档要保证能被用户检索到,至少要等待一个完整的更新周期; Apply only to add a new document, does not apply to frequent deletion and modification of application documents;> can not guarantee real-time, to ensure that the new document can be retrieved by the user, at least to wait for a complete update cycle;

>不能保证一致性,在合并过程中,词典始终分为已更新和未更新两部分,在检索新增文档时,会出现有时能检索到而有时又检索不到的情况。 > Can not guarantee consistency in the merger process, the dictionary has been updated and is always divided into two parts is not updated, when searching for new documents, the situation can sometimes be retrieved and sometimes not retrieved will appear.

由前面的分析可以看出,索引更新的困难在于为了更新少数文档,往往需要改写大部分索引库,虽然索引库中绝大多数文档与这次更新无关。 As can be seen from the foregoing analysis, the difficulty is to update the index in order to update a few documents, often need to rewrite most of the index database, although the vast majority of documents in the index database has nothing to do with this update. 以MultiText为例,即使为了更新一篇文档,也需要重写整个索引库。 To MultiText, for example, even in order to update a document, but also need to rewrite the entire index library.

发明内容 SUMMARY

本发明的目的在于提出一种新的全文检索系统的索引在线更新方法,使得在不影响全文检索系统的检索功能的情况下,保证索引更新的实时性和一致性。 Object of the present invention is to propose a new method for updating the index line full-text search system, so that without affecting the function of full-text search retrieval system, to ensure consistency and index real-time updates.

本发明的具体实现方法为: 一种新的全文检索系统的索引在线更新方法, 包括以下步骤: The method of the present invention is embodied as: a new-line full-text search systems index updating method, comprising the steps of:

1) 将索引库分成两部分:主索引库和辅助索引库;所述辅助索引库与主索引库的结构相同,且所述辅助索引库完整存储于内存和磁盘上,负责暂存最近新增文档; 1) The index database is divided into two parts: the main index and a secondary index repository database; same structure as the main index and a secondary index database library, library full index and the auxiliary memory and stored on disk, responsible recently added temporary documents;

2) 读取待更新索引的内容; 2) read the contents of the index to be updated;

3) 判断待更新索引的操作类型是新增还是删除操作,分别进行如下处理: A:如是新增操作,在辅助索引库中加入待更新索引的内容, 3) the type of operation is determined to be the index update or delete operation is new, were treated as follows: A: the case of new operations, to be added to the contents of the index update database in a secondary index,

B:如是删除4乘作,在辅助索引库中保存文档删除信息,所述文档删除信息采用布尔向量进行保存,每个文档对应于布尔向量的一位。 B: 4 by deleting the case to make, save the document to delete information in the secondary index database, delete the document stored information using Boolean vectors, each document corresponding to a Boolean vector. 所述主索引库和辅助索引库的分类标准为'.所述主索引库由占绝大多数的很少改变的文档组成,辅助索引由经常改变的少数文档组成。 The primary and secondary index index database library classification criteria '. The main index little changed by the document repository predominantly composed of an auxiliary indexed by constantly changing the composition of a small number of documents.

进一步,判断辅助索引是否需要合并到主索引中,如果需要合并,将需合并的辅助索引以及文档删除信息合并到主索引中,并清空已合并的辅助索引和文档删除信息。 Further, to determine whether the secondary index needs to merge into the main index, if the merger needs, will be required to merge secondary index, and delete the document information into the main index, and empty the merged document and delete the secondary index information.

进一步,判断是否还有待更新索引的内容,如果有则跳转到步骤2), 否则,判断是否有终止更新索引的请求,如有,结束操作,否则,等待一段时间后继续进行判断操作。 Further, the content yet to be determined whether the index is updated, if there is a jump to the step 2), otherwise, determines whether a request to terminate the index is updated, if the operation ends, otherwise, after a period of time is determined to continue the operation.

判断辅助索引是否需要合并到主索引中,按照以下A、 B或C的标准执行: ' Determining whether the secondary index needs to be incorporated into the main index in accordance with the following standard A, B or C: '

A:预先设置辅助索引的文件大小或容纳的文档个数,当超过设置的文件大小或文档个数时,则进行合并; A: file size of the document set in advance the number of secondary indexes or receiving when the file size or the number of documents exceeds the threshold, the merge;

B:系统的繁忙程度低于预设的参数时,则进行合并; B: When the system is busy lower than the preset parameters, the merged;

C: A、 B两者的结合。 C: combination of the two A, B.

主索引和辅助索引可以是倒排索引,后缀数组和签名文件等索引结构形式。 Main index and secondary index may be inverted index structure in the form of an index, and the suffix array signature files.

主索引库和辅助索引库的具体分类需根据具体应用环境来决定,包括应用的数据总量、每天/每小时新增数据量、硬件配置情况。 DETAILED primary classification and the secondary index repository index database according to the specific needs of the application environment, including the total amount of data applications, daily / hourly amount of new data, the hardware configuration.

本发明的效果在于:本发明中通过利用辅助索引实现全文检索系统的索引在线更新,从而达到在不影响全文检索系统的检索功能的情况下保证索引更新的实时性和一致性的索引在线更新的目的。 Effect of the present invention is that: the present invention, by utilizing the secondary index full-text search system updates the index line, so as to achieve real-time update index to ensure consistency index and updated online without affecting the full text search retrieval functions purpose. 实验表明,在普通PC环境下(CPU 为P4 2.0G,内存为l.OGB),本发明实现的全文检索达到的索引实时在线更新和保证完整性的目的。 Experiments show that ordinary PC environment (CPU is P4 2.0G, memory is l.OGB), the present invention is to achieve full-text search index reached a real-time online updates and ensure the integrity of purpose. 实验中当辅助索引文档数小于10,000时,新增操作具有很快的速度(均在0.3秒以下),删除操作速度不受辅助索引影响,而修改操作是删除操作和添加操作的组合,耗时约为两者之和。 When the number of documents indexed auxiliary experiments less than 10,000, having a very high speed add operation (both 0.3 seconds), the operation speed is not remove the influence of secondary indexes, the modify operation is a combination of deletion and addition operations, time-consuming about the sum of both. 附图说明 BRIEF DESCRIPTION

图1是本发明所述方法的流程图。 1 is a flowchart of the method of the invention.

具体实施方式 Detailed ways

下面结合附图对本发明的具体实施方式进行详细阐述。 DETAILED DESCRIPTION Hereinafter, embodiments of the present invention will be set forth in detail in conjunction with the accompanying drawings.

实际应用中发现索引库的更新操作通常具有局部性。 Found practical applications update the index generally have local library. 根据这一特点,本复 According to this feature, this complex

明方法中将索引库分成两部分:占绝大多数的很少改变的文档组成的主索引和最近经常改变的文档组成小的辅助索引。 Method Index in the Ming library is divided into two parts: the main index of the vast majority of documents consisting of little changes and recent documents often change the composition of small auxiliary index.

这里的绝大多数、很少改变,需根据具体应用环境来决定,包括应用的敦据总量、每天或每小时新增数据量、硬件配置情况等来决定。 Here the vast majority of small change, to be determined according to the specific application environment, including London, according to the total amount of applications, or add the amount of data, such as hardware configuration to determine the hourly every day. 例如:在应用中: For example: in the application:

将每天更新的内容放在辅助索引库中,在半夜系统空闲时将辅助索引合并到主索引中。 The daily updated content on the secondary index database, the system is idle in the middle of the night will be merged into the secondary index main index.

由于辅助索引容量小,在其上的更新操作可以很快完成,保证了实时性; 而且更新操作所需时间、临时空间和计算资源都较小,从而避开了分步更新f 来的一致性问题。 Due to the small capacity of the secondary index, on which the update operation can be done quickly, ensure real-time; and the time required to update the temporary space and computing resources are small, thus avoiding a step by step f to update consistency problem.

如果辅助索引象主索引那样放在磁盘上,则会引入性能问题。 If the secondary index as the main index as on disk, performance issues will be introduced. 因为检索6勺性能取决于磁盘访问次数,如杲辅助索引放在磁盘上的话,那么原本一次磁ft 访问就可以完成的检索,该方法中至少需要两次石兹盘访问,开销大了近一倍。 Because retrieval 6 tablespoons performance depends on the number of disk access, such as Gao secondary index on disk, then had a magnetic ft access can be completed retrieval, this method requires at least two stone hereby disk access, big spending nearly a times. 既然辅助索引尺寸远小于主索引,完全可以全部放在内存中。 Since the secondary index is much smaller than the size of the main index, it can all be placed in memory. 但考虑到一致植问题,辅助索引不能只放在内存,否则一旦系统发生故障,辅助索引的全部内容将丢失,索引库就不完整了。 However, considering the same plant problem, not just a secondary index in memory, or if the system fails, the entire contents of the secondary index will be lost, the index database is not complete. 因此,磁盘上也需要一个备份。 Therefore, it needs a backup on disk.

本发明中的辅助索引是:与主索引结构相同,但同时完整存储于内存和慮盘上,负责暂存最近新增文档的索引。 Secondary index in the present invention is: the same as the main index structure, but a complete memory stored on disk and considered responsible for the staging index recently added documents.

下面对本发明的具体实现方法进行举例。 The following example of the specific implementation for the method of the present invention.

本发明在普通PC环境下(CPU为P4 2.0G,内存为1.0GB)进行实验,接照本发明的方法实现全文检索的索引在线更新。 The present invention is in a normal PC environment (CPU as P4 2.0G, memory is 1.0GB) experiment, then indexing according to the method of the present invention to achieve full-text search of online updates. 如图1所示,具体包括以下步骤: As shown in FIG. 1, includes the following steps:

l)读取待建索引的内容; 作和新增操作; l) reading the content to be indexed; and for new operations;

3) 如杲操作类型是新增文档则执行步骤4,如果操作类型是删除文档则执行步骤8; 3) If the operation type is a new document Gao step 4 is performed, if the operation type is delete documents perform step 8;

4) 在辅助索引上加入新增文档内容的索引; 4) index added new contents of the document on the secondary index;

5) 判断辅助索引是否需要合并到主索引中,如果需要合并,则执行步骤6,否则跳转到步骤9; 5) determining whether the secondary index needs to be incorporated into the main index, combined if necessary, proceed to step 6, otherwise skip to step 9;

6) 将辅助索引和文档删除信息合并到主索引中; 6) Delete the secondary index and document information into the main index;

7) 清空辅助索引和用于保存文档删除信息的布尔向量,跳转到步骤9; 7) for emptying the secondary index and save the document to remove information Boolean vector, skip to Step 9;

8) 对于删除文档操作,将保存文档删除信息的布尔向量的相应位设置为 8) to delete the document operation, the corresponding bit is set to save the document to delete information for Boolean vector

r, r,

9) 判断是否还有待建索引的内容,如果有则跳转到步骤l,否则执行步' 骤10; 9) determines whether there is content to be indexed, and if there is then a jump to step L, otherwise go to step 'step 10;

10) 判断是否有终止建索引的请求,如果有则退出,否则等待一段时间后跳转到步骤9。 10) to determine whether there is a request to terminate the construction of the index, if you quit, or after a period of time to jump to Step 9.

对于上述步骤中的删除操作,使用一个布尔向量处理删除操作。 For the delete operation of the above steps, the use of a Boolean vector processing delete operation. 该布尔向量的每一位对应一篇文档。 The Boolean vectors corresponding to each bit of a document. 删除一篇文档时就把对应的位设置为"l"。 Then the corresponding bit when you delete a document set to "l". 检索和索《!合并算法都会跳过对应为,T,的文档,从应用角度来看达到了删除的效果, 在进行索引合并时,这些被标为,T,的文档由于被合并算法跳过,将真正从主索引中消失。 Retrieval and search "! Merging algorithm will skip correspondence is, T, document, from the application point of view to the effect of deleting, during an index merge, these are labeled, T, because the documents are merged algorithm skipped that will truly disappear from the main index.

由于本发明既可采用倒排索引结构的方式,也可采用充后缀数组和签名文件检索的方式,在操作方法上并没有不同。 Since the embodiment of the present invention can adopt an inverted index structure, and may also be filled suffix array signature file retrieval mode, the operation method is no different. 本实验中采用倒排索引结构作为主索引和辅助索引的索引结构,通过判断辅助索引中包含的文档数量和系统程度决定是否进行索引的合并。 Index structure used in this experiment inverted index structure as the main index and a secondary index, the number and extent determined by judging system documentation contained in the secondary index, whether combined index. 实验选用的数据是从因特网上抓取的新闻类中文网页,提取出网页的新W Experiment, the data is fetched from the Chinese Internet news pages, extract the pages of the new W

内容作为文本文件,每个文件为一篇新闻稿件,共100万篇,共2.68GB。 As the contents of text files, each file is a press release, a total of 1,000,000, a total of 2.68GB. 实验主要考察如下两个问题: Experimental mainly on the following two questions:

>使用辅助索引后更新一篇文档需要多长时间,能否满足实时性要求? > After using a secondary index update a document how long it takes, you can meet real-time requirements? >使用辅助索引对检索效率影响如何? > Use secondary indexes affect how the retrieval efficiency?

实验中使用不同的辅助索引大小(以容纳的文档个数为单位)测量了增量新增、删除、更新一篇文档所需平均时间。 Experiments using different secondary index size (to accommodate the number of documents in units) measured incremental add, delete, the average time required to update a document. 实验结果如表l所示,当辅助索引文档数小于10,000时,新增和删除操作具有很快的速度(均在0.3秒以下)。 The results are shown in Table L, the number of documents when the secondary index is less than 10,000, add and delete operations having a very high speed (average 0.3 seconds or less). 也就是说,只要将辅助索引文档数限制在10,000以下,本发明方法有很好的实时性。 In other words, as long as the secondary index number of documents is limited to 10,000 or less, the present method has good real-time. 同时,从实验结果还可以看出删除操作速度不受辅助索引影响,而修改操作是删除操作和添加操作的組合,耗时约为两者之和。 Meanwhile, it can be seen from the experimental results from the deletion of secondary indexes speed impact, modify and delete operations are combined add operation, takes approximately the sum of both. 实验说明本发明方法具有很好的实时性。 Experimental methods of the invention described with good real-time.

表l:主索引文档数为l百万时,增量索引时间开销随辅助索引文档数变化情况 Table l: the main index of the number of documents to l million incremental cost index time the number of documents with the secondary index changes

辅助索引文档数 添加耗时(秒) 删除耗时(秒) 小务改耗时(秒) Secondary index number of documents added time-consuming (s) Delete time-consuming (s) service small change takes (seconds)

1 0.010 0.205 0.220 1 0.010 0.205 0.220

10 0.042 0.213 0.292 10 0.042 0.213 0.292

100 0.051 0.204 0.27! 100 0.051 0.204 0.27!

1,000 0.070 0.244 0.306 1,000 0.070 0.244 0.306

10,000 0,223 0.200 0.439 10,000 0,223 0.200 0.439

100,000 3.31 0.376 4.05 100,000 3.31 0.376 4.05

表2的实验结果表明,索引的更新时间与主索引的大小没有关系,因为索《1的更新过程完全是在辅助索? The results in Table 2 show that the index update time has nothing to do with the size of the main index, as the index update process "1 entirely in the auxiliary cable? 1上进行了。 1 was.

表2:辅助索引文档数为l万时,增量索引时间开销随主索引文档数变化情况 Table 2: Number of documents to the auxiliary index l million and the incremental cost of the number of documents indexed time changes with the main index case

主索引文档数 添加專c时(秒) 删除耗时(秒) 】奮改耗时(秒) When the primary index number of documents Adding spot c (s) Delete time-consuming (s)] Fen change the time-consuming (s)

l,OOO 0.223 0.200 0.439 l, OOO 0.223 0.200 0.439

IO,OOO 0.223 0.200 0.439 IO, OOO 0.223 0.200 0.439

画,ooo 0.223 0.200 0.439 Painting, ooo 0.223 0.200 0.439

l,OOO,OOO 0.223 0.200 0.439 l, OOO, OOO 0.223 0.200 0.439

为了考察辅助索引对检索速度的影响,实验中用IOO个检索词进行检索计算检索平均时间。 In order to investigate the influence of secondary indexes on the search speed, using a search word experiment IOO retrieval retrieves the average time is calculated. 实验结果见表3。 The results are shown in Table 3. 以无辅助索引的时间为基准,增加的部分可以看作辅助索引的开销。 No aid to the index as a benchmark, part of the increase can be seen as the cost of secondary indexes. 辅助索引大小在10000以下时,开销都小于5%, 可以说是用户无法感知的。 Secondary index size when 10,000 or less, spending less than 5%, can be said that the user can not perceive.

表3:主索引文档数为l百万时,检索速度随辅助索引文档数变化情况 Table 3: Number of documents to the main index l million aid retrieval speed with changes in the number of documents indexed

<table>table see original document page 13</column></row> <table> <Table> table see original document page 13 </ column> </ row> <table>

综合以上实验结果,本发明提出的方法实现了全文检索系统的索引在线更新,在具有4艮好的检索性能的情况下,保证索引更新的实时性和一致性。 Based on the above results, the method proposed by the present invention enables full-text search indexing system of online updates, in the case of a 4 Gen good retrieval performance, real-time and ensure consistency index update.

明的精神和范围。 The spirit and scope of the Ming. 这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包舍这些改动和变型在内。 Thus, if these modifications and variations of the present invention fall within the claims of the invention and the scope of equivalents thereof, the present invention is also intended to package these modifications homes and variations.

Claims (7)

1、一种全文检索系统的索引在线更新方法,包括以下步骤:1)将索引库分成两部分:主索引库和辅助索引库;所述辅助索引库与主索引库的结构相同,且所述辅助索引库完整存储于内存和磁盘上,负责暂存最近新增文档;2)读取待更新索引的内容;3)判断待更新索引的操作类型是新增还是删除操作,分别进行如下处理:A:如是新增操作,在辅助索引库中加入待更新索引的内容,B:如是删除操作,在辅助索引库中保存文档删除信息,所述文档删除信息采用布尔向量进行保存,每个文档对应于布尔向量的一位。 A full text search method for indexing online update system, comprising the steps of: 1) the index repository into two parts: the main index and a secondary index repository database; same structure as the main index with the index repository auxiliary database, and the secondary indexes complete library stored on disk and memory, responsible for staging recently added documents; 2) to read the contents of the index to be updated; 3) determine the index to be updated or new type of operation is a delete operation, were treated as follows: a: new case, the contents of the index to be added to update the index database in the auxiliary, B: the case of a delete operation, stored in the secondary index database document deletion information, deletes the information of the document is saved using Boolean vectors, each document corresponding to in a Boolean vector.
2、 如权利要求1所述的全文检索系统的索引在线更新方法,其特征在于, 所述主索引库和辅助索引库的分类标准为:所述主索引库由占绝大多数的很少改变的文档组成,辅助索引由经常改变的少数文档组成。 2. A method as claimed in full-text retrieval index online updating system according to claim 1, wherein the primary and secondary index database index repository classification criteria: the main index library rarely changed by the majority of document, and the subsidiary indexed by constantly changing the composition of a small number of documents.
3、 如权利要求2所述的全文检索系统的索引在线更新方法,其特征在于,所述步骤3)进一步包括以下操作:4) 判断辅助索引是否需要合并到主索引中,如果需要合并,将需合并的辅助索引以及文档删除信息合并到主索引中,并清空已合并的辅助索引和文档删除信息。 3, the index line as claimed in full-text retrieval system update method according to claim 2, wherein said step 3) further comprises the following operations: 4) to determine whether to merge secondary index into the main index, if desired combined, the need to merge secondary index and document deletion information into the main index, and empty the merged document and delete the secondary index information.
4、 如权利要求3所述的全文检索系统的索引在线更新方法,其特征在于,所述步骤4)进一步包括以下操作:5) 判断是否还有待更新索引的内容,如果有则跳转到步骤2),否则, 判断是否有终止更新索引的请求,如有,结束操作,否则,等待一段时间后继续进行判断。 4. The method of claim indexing online updating the full-text retrieval system of claim 3, wherein said step 4) further comprises the following operations: whether 5) remains to be determined updates the index, if there are jumps to step 2), otherwise, it determines whether a request to terminate the index is updated, if the operation ends, otherwise, continue to wait for some time after the determination.
5、 如权利要求4所述的全文检索系统的索引在线更新方法,其特征在于,判断辅助索引是否需要合并到主索引中,按照以下A、 B或C的标准执行:A:预先设置辅助索引的文件大小或容纳的文档个数,当超过设置的文件大小或文档个数时,则进行合并;B:系统的繁忙程度低于预设的参数时,则进行合并; C: A、 B两者的结合。 5. The method as claimed in full-text retrieval index online updating system according to claim 4, wherein determining whether the secondary index needs to be incorporated into the main index in accordance with the following standard A, B or C: A: a preset secondary indexes file size or the number of documents contained, when the number of document or file size exceeds the threshold, the merge; B: when the system is busy lower than the preset parameters, the merged; C: a, B two combination of persons.
6、 如权利要求l-5任一项所述的全文检索系统的索引在线更新方法, 其特征在于,主索引和辅助索引可以是倒排索引,后缀数组和签名文件等索引结构形式。 6, the online update method as claimed in indexing text retrieval system according to any one of claims l-5 claims, wherein the main index and a secondary index may be inverted index structure index, suffix array and signature files.
7、 如权利要求2所述的全文检索系统的索引在线更新方法,其特征在于,所述主索引库和辅助索引库的具体分类需根据具体应用环境来决定, 包括应用的数据总量、每天或每小时新增数据量、硬件配置情况。 7, update method as claimed in-line indexing text retrieval system according to claim 2, characterized in that the specific classification of the primary and secondary index repository index database according to the specific needs of the application environment, including the total amount of data applications per day or add the amount of data per hour, the hardware configuration.
CN 200610112800 2006-09-01 2006-09-01 Indexes on-line updating method of full text retrieval system CN100495400C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200610112800 CN100495400C (en) 2006-09-01 2006-09-01 Indexes on-line updating method of full text retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200610112800 CN100495400C (en) 2006-09-01 2006-09-01 Indexes on-line updating method of full text retrieval system

Publications (2)

Publication Number Publication Date
CN101136016A CN101136016A (en) 2008-03-05
CN100495400C true CN100495400C (en) 2009-06-03

Family

ID=39160117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200610112800 CN100495400C (en) 2006-09-01 2006-09-01 Indexes on-line updating method of full text retrieval system

Country Status (1)

Country Link
CN (1) CN100495400C (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408882B (en) 2008-08-05 2012-10-31 北京大学 Method and system for searching authorization document
CN102096676B (en) * 2009-12-11 2014-04-09 中国移动通信集团公司 Data updating and query control method and system
US8244700B2 (en) * 2010-02-12 2012-08-14 Microsoft Corporation Rapid update of index metadata
CN102270201B (en) * 2010-06-01 2013-07-17 富士通株式会社 Multi-dimensional indexing method and device for network files
CN102004800A (en) * 2010-12-28 2011-04-06 北京数码大方科技有限公司 Data query method and device of PDM (Product Data Management) system
CN102081649B (en) * 2010-12-31 2012-08-15 深圳联友科技有限公司 Method and system for searching computer files
CN102890682B (en) * 2011-07-21 2017-08-01 腾讯科技(深圳)有限公司 A method of indexing, retrieval method, apparatus and system for
CN103186622B (en) * 2011-12-30 2016-03-30 北大方正集团有限公司 One kind of full-text retrieval system, index information update method and apparatus
CN103207872A (en) * 2012-01-17 2013-07-17 深圳市快播科技有限公司 Real-time indexing method and server
US9245003B2 (en) * 2012-09-28 2016-01-26 Emc Corporation Method and system for memory efficient, update optimized, transactional full-text index view maintenance
CN104424267A (en) * 2013-08-29 2015-03-18 北大方正集团有限公司 Index data inserting method and index data inserting system
CN104077379A (en) * 2014-06-25 2014-10-01 北京海泰方圆科技有限公司 Method for index updating
CN104361009B (en) * 2014-10-11 2017-10-31 北京中搜网络技术股份有限公司 Real-time indexing method based on inverted index
CN104598550B (en) * 2014-12-31 2018-09-25 北京奇艺世纪科技有限公司 A kind of update method and device of Internet video index
CN104504144A (en) * 2015-01-05 2015-04-08 浪潮(北京)电子信息产业有限公司 Method and device for acquiring index-related information
CN104899249B (en) * 2015-05-04 2018-07-13 中国科学院信息工程研究所 Reliable index upgrade system and method under a kind of mass data
CN105512339A (en) * 2015-12-31 2016-04-20 深圳市朗科科技股份有限公司 File searcher and searching method
CN106484815B (en) * 2016-09-26 2019-04-12 北京赛思信安技术股份有限公司 A kind of automatic identification optimization method based on mass data class SQL retrieval scene
CN109144994A (en) * 2017-06-19 2019-01-04 华为技术有限公司 Index updating method, system and relevant apparatus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1253643A (en) 1997-03-12 2000-05-17 明德特里尔股份有限公司 Computerized method for dynamically creating, modifying, removing and maintaining information in database
EP1408428A1 (en) 2002-10-09 2004-04-14 France Telecom System and procedure in the processing and visualisation of search results produced by an index based search engine, interface model and corresponding meta-models
CN1635494A (en) 2003-12-27 2005-07-06 海信集团有限公司 Method for implementing class memory database access and retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1253643A (en) 1997-03-12 2000-05-17 明德特里尔股份有限公司 Computerized method for dynamically creating, modifying, removing and maintaining information in database
EP1408428A1 (en) 2002-10-09 2004-04-14 France Telecom System and procedure in the processing and visualisation of search results produced by an index based search engine, interface model and corresponding meta-models
CN1635494A (en) 2003-12-27 2005-07-06 海信集团有限公司 Method for implementing class memory database access and retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种实时更新索引结构的设计与实现. 王智强,刘建毅.计算机系统应用,第10期. 2005

Also Published As

Publication number Publication date
CN101136016A (en) 2008-03-05

Similar Documents

Publication Publication Date Title
US8886598B1 (en) Tag-based synchronization
CN1811757B (en) System and method for locating pages on the world wide web and for locating documents from a network of computers
US8037075B2 (en) Pattern index
CN102395962B (en) Composite hash and list partitioning of database tables
US7860907B2 (en) Data processing
US6801904B2 (en) System for keyword based searching over relational databases
JP4306152B2 (en) Web system with a clustered application server and database structure
US20060041606A1 (en) Indexing system for a computer file store
US8051045B2 (en) Archive indexing engine
US7293005B2 (en) Pipelined architecture for global analysis and index building
CN101223517B (en) Intelligent container index and search method and system
KR101389148B1 (en) Suggesting and refining user input based on original user input
CN102567140B (en) Use change journal file system backup
US8244530B2 (en) Efficient indexing of documents with similar content
US8140495B2 (en) Asynchronous database index maintenance
CN100498740C (en) Data cache processing method, system and data cache device
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20100070515A1 (en) Shared namespace for storage clusters
US8495036B2 (en) Blob manipulation in an integrated structured storage system
CN102222085B (en) Data de-duplication method based on combination of similarity and locality
JP4522170B2 (en) Relational database indexes additional program, the index adding unit and index adding method
JP2006164211A (en) System, method and program for storage management
US8417746B1 (en) File system management with enhanced searchability
CN100405371C (en) Method and system for abstracting new word
US8620884B2 (en) Scalable blob storage integrated with scalable structured storage

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C14 Granted