CN103020299B - The store method of inverted index and supplemental data thereof and memory storage in full-text search - Google Patents

The store method of inverted index and supplemental data thereof and memory storage in full-text search Download PDF

Info

Publication number
CN103020299B
CN103020299B CN201210591989.9A CN201210591989A CN103020299B CN 103020299 B CN103020299 B CN 103020299B CN 201210591989 A CN201210591989 A CN 201210591989A CN 103020299 B CN103020299 B CN 103020299B
Authority
CN
China
Prior art keywords
indexing units
units data
data
tree
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210591989.9A
Other languages
Chinese (zh)
Other versions
CN103020299A (en
Inventor
范振勇
吴震
张学
崔维力
武新
赵伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin NanKai University General Data Technologies Co., Ltd.
National Computer Network and Information Security Management Center
Original Assignee
TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd, National Computer Network and Information Security Management Center filed Critical TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority to CN201210591989.9A priority Critical patent/CN103020299B/en
Publication of CN103020299A publication Critical patent/CN103020299A/en
Application granted granted Critical
Publication of CN103020299B publication Critical patent/CN103020299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides the method for the efficient preservation inverted index in a kind of text retrieval system, comprising: detect indexing units data length and whether be greater than threshold values K; If indexing units data are greater than n*K and are less than (n+1) * K(n is natural number), by indexing units data from beginning to the part of n*K stored in indexing units data blocks, will remain indexing units data stored in B tree in; If indexing units data equal n*K, by indexing units data from beginning to the part of n*K stored in indexing units data blocks; If indexing units data are less than K, by indexing units data all stored in B tree.The invention has the beneficial effects as follows the storage efficiency that effectively can improve the full-text index of inverted entry, improve data read rates, can conveniently realize copy-on-write (Copy? On? Write) mechanism, and then the concurrent index that improve data security and reading data.

Description

The store method of inverted index and supplemental data thereof and memory storage in full-text search
Technical field
The invention belongs to field of data storage, especially relate to a kind of store method and memory storage of inverted index and supplemental data thereof in full-text search
Background technology
In relational database system, full-text index is one of mode of search file data full blast, under current network environment, quantity of information and customer volume all become volatile growth, one of full-text index Main Means becoming information retrieval system, inverted index is the core of text retrieval system, and its storage organization also has a great impact text retrieval system performance.
Inverted index (English: Invertedindex), also be often called as reverse indexing, insert archives or reverse archives, be a kind of indexing means, be used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.By inverted index, the lists of documents of this word can be comprised according to word quick obtaining.
Full-text index inverted entry data are made up of one group of document code corresponding to TermID and skew in a document, its form of expression is TermID-->{<docID, { offset}}, wherein TermID is the minimum index unit that hyphenation device divides, in Chinese Full Text Retrieval, be generally the combination of word, word, English, numeric string and several form, there is following characteristics:
1. different Term correspondence is fallen row's data length and is differed greatly, common word as " ", " " etc. the character height that often frequency of occurrences is suitable, also have the special string only occurred once.
2. in full-text index, vocabulary is huge, and often a full-text search storehouse can have ten million vocabulary, and each vocabulary, as a retrieval unit, can take a large amount of storage spaces
The storage of the full-text index certificate of falling row generally has two kinds: one, adopt empirical data that known dictionary is divided into middle and high, low several grades mode by word frequency, every grade of word frequency adopts different data block sizes to store, the every blocks of data storing high frequency words is comparatively large, otherwise low-limit frequency block is minimum.The benefit of this mode is that disk waste is less, reading efficiency is secure, and shortcoming is not inconsistent once real data and empirical data or occurs new vocabulary, and larger change occurs word frequency, then occur high frequency words space waste, low-frequency word word chain overlength causes reading efficiency low.Two, the mode dividing large data cell is adopted, if every 1GB data are a unit, in each unit, all inverted entry data press entry numbering, document code, side-play amount (TermID, DocID, Offset) sequence, before index completes, all data cells are carried out data and merge into a final unit.The advantage of this mode is space waste is 0, and shortcoming is that final data sorting needs one times of disk remaining space, and cause disk space utilization factor to be 50%, the time of merging is longer.
Summary of the invention
The problem to be solved in the present invention is to provide a kind of method of the efficient preservation index in text retrieval system, can fully improve disk space utilization factor.
For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of method of the efficient preservation index in text retrieval system, comprising:
1) magnitude relationship of indexing units data length and default threshold values K is compared;
2) if indexing units data length is less than K, by indexing units data all stored in B tree;
3) if indexing units data length equals K, by indexing units data from beginning to the part of K stored in indexing units data blocks;
4) if be greater than K, then compare indexing units data length and n*K (n=2,3 ...) magnitude relationship, and to store according to following manner:
If a. indexing units data length is greater than (n-1) * K and is less than n*K, by indexing units data from beginning to (n-1) * K part stored in indexing units data block, by remainder stored in B tree in;
If b. indexing units data length equals n*K, by all indexing units data in order stored in indexing units data block.
Further, described B tree can store at least two length and be less than the indexing units data that K value or length is greater than the remaining data after K value is stored in block device.
Further, the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.
According to a further aspect in the invention, present invention also offers a kind of memory storage of efficient preservation indexing means, comprising:
Store data block unit, be used for storing the indexing units data of integral multiple of K;
B sets memory page unit, is used for storing the remainder of indexing units data deficiencies K.
Further, the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.
Storage means according to the additional full-text index data of the memory storage of a kind of efficient preservation indexing means provided by the present invention:
1) the indexing units data in former B tree and the indexing units data sum added is calculated;
2) if indexing units data length sum is greater than n*K and be less than (n+1) * K (n=2,3 ...), the part that B tree stores is taken out and the part that indexing units data are stored from B tree to the part of n*K stored in indexing units data blocks, will remain during indexing units data set stored in B;
3) if indexing units data sum equals n*K, the part that B tree stores is taken out and indexing units data are set the part of storage area to n*K stored in indexing units data block from B;
4) if indexing units data sum is less than K, then will add part and set storage area stored in former B successively.
Owing to adopting technique scheme, effectively can improve the storage efficiency of the full-text index of inverted entry, decrease the waste of disk space, effectively improve data read rates; Conveniently can realize copy-on-write (CopyOnWrite) mechanism, and then improve the concurrent index of data security and reading data.
Accompanying drawing explanation
Fig. 1 is the store method schematic flow sheet of efficient preservation indexing means of the present invention and supplemental data
Fig. 2 is the storage schematic diagram of an example in the present invention
Fig. 3 is the schematic diagram realizing copy-on-write mechanism in example of the present invention
Embodiment
Implement to use in scene at one of the present invention, according to setting up corresponding full-text index inverted entry data in full, its representation is TermID (entry numbering)-->{<docID (document code), { offset} (side-play amount) }, the mode adopting B to set combined block file in this example stores Key-Value data, Key is entry numbering, Value is inverted entry data, elongated data, when creating index data, the data of Value constantly add, when length exceedes default threshold value n*K (n=1, 2, 3 ...) time, the data being n*K from the length of beginning extraction Value are stored into module unit, for the index data being less than k of remainder, module unit adopts fixed-length data to preserve the inverted entry data of corresponding TermId, each data block set with B in corresponding key value part form a chain.Thus form the block portions that block file data preserve inverted entry, preserve the fragment of inverted entry in B tree.
Here threshold values in the setting of K value be determine with reference to the memory performance storing inverted entry data, generally all adopt the minimum unit of Computer Storage, as 32k, 64k etc., facilitate magnetic disc array storage data.
As shown in Figure 2, suppose three index minimum units that in full, total hyphenation device delimited, be numbered TermID (1), TermID (2) and TermID (3) respectively, wherein TermID (1) is a character string only occurring few number of times in the text, when its inverted entry index is stored, as shown in figure 21, the data length of the inverted entry index of TermID (1) is not more than K, i.e. TermID (1)-value<k, is deposited into B and sets in memory page by its index data.
TermID (2) is another character string, the number of times occurred in the text is more, when its inverted entry index is stored, as shown in figure 22, the data length of the inverted entry index of TermID (2) equals default threshold values K, i.e. TermID (1)-value=nk (n=1,2,3 ...), equally its index data is deposited into storage block unit.
TermID (3) is another character string in full-text index, often occur in the text, when its inverted entry index is stored, as shown in Reference numeral in Fig. 2 23, the data length of the inverted entry index of TermID (2) equals default threshold values K, i.e. (n+1) * k>TermID (1)-value>n*k, the data being K from the length of beginning extraction Value are stored into storage block unit, remainder stored in B tree in, data block set with B in corresponding key value part form a chain.
In practice, the situation often having inverted entry data supplementing occurs, according to the situation of inverted entry data different length, still different corresponding stored methods is had, still TermID (1) in an above example, illustrate, TermID (1) has a small amount of supplemental data, the index data length that B in TermID (1) sets in memory page is added with supplemental data length, if TermID (the 1)-value=nk after being added, then by former B, the index data set in memory page takes out, the portion of the index data memory page and supplemental data is set successively stored in data block from B, and delete B set key assignments be the data of TermID, if (n+1) K (n=1 after being added, 2,3 ...) >TermID (1)-value>nk (n=1,2,3 ...), then by former B, the index data set in memory page takes out, and be added to the part of nk in order stored in storage block from beginning to original index data in the index data set by former B in memory page and supplemental data, and remaining supplemental data is updated in TermID identical in B tree, if the indexing units data and the original B that add set the indexing units data length stored and are added and are still less than K, then by supplemental data successively stored in B tree memory page.
According to the long tail effect of large data, the data that in inverted entry in full-text search, most of entry is corresponding are considerably less, be not enough to reach block file unit-sized, the mechanism that page preserves the elongated Value of Key-is set by B, the fragmentary data of the inverted entry of multiple Key can be preserved in a page, and the default disk utilization of B tree is 75%, then total disk utilization is (B sets file size × 75%+ block file size)/(B sets file size+block file size); When block file is larger, total disk utilization trend 100% (be 99.5% in 1,000 ten thousand news type wen chang qiao district data).
The page of b-tree indexed and the module unit of block file are all fixed-length datas, conveniently directly locate and use cache algorithm, wherein page size=4 × module unit the size of B tree.The memory size, disk read-write efficiency and other factors that can be used for Cache can be considered in actual items, finally determine the size of each module unit of the Page size that B sets and block file.Block file can utilize system idle time to arrange, and the certificate of falling row of same Term is put into continuous print disk space.
The certificate of falling row is less than to the entry of a module unit, each reading is consuming time is exactly the consuming time of reading B tree, it reads disk number of times is that B sets the degree of depth L time, for the entry of the certificate of falling row more than a module unit, the cost of each complete reading data is reading B trees and reads a blocks of data, consider in full-text index conventional with or etc. logical operation, the certificate of falling row is generally random reading, then can make full use of the fixed length characteristic of the every block of block file, finally effectively improve data reading performance using redundancy.
When full-text index creates index in real time, produce the supplementary demand of the certificate of falling row, B tree causes as layered data characteristic, copy-on-write mechanism can be realized easily, because B tree is a tree structure, during amendment, B just first copies portion before setting each page of amendment, also create one to memory page that is direct or this page of indirect referencing to copy, all be modified in copy modify, thus define amendment before and amendment latter two version, before a modification version carries out read operation, after the modification version carries out write operation, both are independent of each other, and improve concurrency performance.It is also like this that block stores, and data only add at end-of-file, can realize read and write abruption equally, be independent of each other, improve concurrent during access.When data are submitted to successfully, writing version conversion for reading version, if submit to unsuccessfully, then writing version and abandoning, thus ensure that the integrality of data.As shown in Figure 3, the two version of read-write can complete real-time supplemental data and improve limit and build the concurrency performance just searched.Conveniently realize COW mechanism, B sets the distortion adopting B+ tree, and the difference that itself and B+ set is the pointer eliminating the sensing brotgher of node of preserving in leaf node, otherwise there will be the phenomenon needing to copy whole B tree when there is amendment.
Above one embodiment of the present of invention have been described in detail, but described content being only preferred embodiment of the present invention, can not being considered to for limiting practical range of the present invention.All equalizations done according to the present patent application scope change and improve, and all should still belong within patent covering scope of the present invention.

Claims (6)

1. a method for the efficient preservation inverted index in text retrieval system, comprising:
1) magnitude relationship of indexing units data length and default threshold values K is compared;
2) if indexing units data length is less than K, by indexing units data all stored in B tree;
3) if indexing units data length equals K, by indexing units data from beginning to the part of K stored in indexing units data block;
4) if be greater than K, then compare indexing units data length and n*K (n=2,3 ...) magnitude relationship, and to store according to following manner:
If a. indexing units data length is greater than (n-1) * K and is less than n*K, by indexing units data from beginning to (n-1) * K part stored in indexing units data block, by remainder stored in B tree in;
If b. indexing units data length equals n*K, by all indexing units data in order stored in indexing units data block.
2. the method for the efficient preservation inverted index in text retrieval system according to claim 1, it is characterized in that: the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.
3. a memory storage for the method for the efficient preservation inverted index in text retrieval system according to claim 1, comprising:
Store data block unit, be used for storing the indexing units data of integral multiple of K;
B sets memory page unit, is used for storing the remainder of indexing units data deficiencies K.
4. the memory storage of the method for the efficient preservation inverted index in text retrieval system according to claim 3, is characterized in that: described B sets memory page unit and can be used to store at least two length and be less than the indexing units data that K value or length is greater than the remaining data after K value is stored in block device.
5. memory storage according to claim 4, is characterized in that: the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.
6. a data supplementing storage means for the memory storage of the method for the efficient preservation inverted index in text retrieval system as claimed in claim 4, comprising:
1) the indexing units data length in former B tree and the indexing units data length sum added is calculated;
2) if indexing units data length sum is greater than n*K and be less than (n+1) * K (n=2,3 ...), the part that B tree stores is taken out and the part that indexing units data are stored from B tree to the part of n*K stored in indexing units data blocks, will remain during indexing units data set stored in B;
3) if indexing units data sum equals n*K, the part that B tree stores is taken out and indexing units data are set the part of storage area to n*K stored in indexing units data blocks from B;
4) if indexing units data sum is less than K, then will add part and set storage area stored in former B successively.
CN201210591989.9A 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search Active CN103020299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210591989.9A CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210591989.9A CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Publications (2)

Publication Number Publication Date
CN103020299A CN103020299A (en) 2013-04-03
CN103020299B true CN103020299B (en) 2016-01-13

Family

ID=47968902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210591989.9A Active CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Country Status (1)

Country Link
CN (1) CN103020299B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679778B (en) * 2013-11-29 2019-03-26 腾讯科技(深圳)有限公司 A kind of generation method and device of search result
CN106227677B (en) * 2016-07-20 2018-11-20 浪潮电子信息产业股份有限公司 A kind of method of elongated cache metadata management
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN107491523B (en) * 2017-08-17 2020-05-05 三星(中国)半导体有限公司 Method and device for storing data object

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536509A (en) * 2003-04-11 2004-10-13 �Ҵ���˾ Inverted index storage method, inverted index mechanism and on-line updating method
CN101226553A (en) * 2008-02-03 2008-07-23 中兴通讯股份有限公司 Method and device for storing length-various field of embedded database
CN101944108A (en) * 2010-09-07 2011-01-12 深圳市彩讯科技有限公司 Index file and establishing method thereof
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4714127B2 (en) * 2006-11-27 2011-06-29 株式会社日立製作所 Symbol string search method, program and apparatus, and trie generation method, program and apparatus
US7725437B2 (en) * 2007-07-31 2010-05-25 Hewlett-Packard Development Company, L.P. Providing an index for a data store

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536509A (en) * 2003-04-11 2004-10-13 �Ҵ���˾ Inverted index storage method, inverted index mechanism and on-line updating method
CN101226553A (en) * 2008-02-03 2008-07-23 中兴通讯股份有限公司 Method and device for storing length-various field of embedded database
CN101944108A (en) * 2010-09-07 2011-01-12 深圳市彩讯科技有限公司 Index file and establishing method thereof
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment

Also Published As

Publication number Publication date
CN103020299A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
US9047301B2 (en) Method for optimizing the memory usage and performance of data deduplication storage systems
US10496621B2 (en) Columnar storage of a database index
CN102831222B (en) Differential compression method based on data de-duplication
CN101963982B (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN104346357B (en) The file access method and system of a kind of built-in terminal
CN110188108B (en) Data storage method, device, system, computer equipment and storage medium
CN101777017B (en) Rapid recovery method of continuous data protection system
CN105912687B (en) Magnanimity distributed data base storage unit
CN110825748A (en) High-performance and easily-expandable key value storage method utilizing differential index mechanism
CN107045531A (en) A kind of system and method for optimization HDFS small documents access
CN104484471B (en) A kind of implementation method of high-performance data storage engines
KR102310246B1 (en) Method for generating secondary index and apparatus for storing secondary index
WO2013152678A1 (en) Method and device for metadata query
CN102364474A (en) Metadata storage system for cluster file system and metadata management method
CN103020299B (en) The store method of inverted index and supplemental data thereof and memory storage in full-text search
JP2005267600A5 (en)
CN102129458A (en) Method and device for storing relational database
CN102999433A (en) Redundant data deletion method and system of virtual disks
US7783589B2 (en) Inverted index processing
CN108021702A (en) Classification storage method, device, OLAP database system and medium based on LSM-tree
WO2023143095A1 (en) Method and system for data query
CN101526965A (en) Locating method of index nodes of disk file and device thereof
EP1955209A2 (en) An architecture and method for efficient bulk loading of a patricia trie
CN102411632A (en) Chain table-based memory database page type storage method
US8156126B2 (en) Method for the allocation of data on physical media by a file system that eliminates duplicate data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: TIANJIN NANDA CONVENTIONAL DATA TECHNOLOGY CO., LT

Effective date: 20130807

Owner name: STATE COMPUTER NETWORK AND INFORMATION SAFETY MANA

Free format text: FORMER OWNER: TIANJIN NANDA GENERAL DATA TECHNOLOGY CO., LTD.

Effective date: 20130807

C41 Transfer of patent application or patent right or utility model
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Fan Zhenyong

Inventor after: Wu Zhen

Inventor after: Zhang Xue

Inventor after: Cui Weili

Inventor after: Wu Xin

Inventor after: Zhao Wei

Inventor before: Zhang Xue

Inventor before: Fan Zhenyong

Inventor before: Cui Weili

Inventor before: Wu Xin

Inventor before: Zhao Wei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: ZHANG XUE FAN ZHENYONG CUI WEILI WU XIN ZHAO WEI TO: FAN ZHENYONG WU ZHEN ZHANG XUE CUI WEILI WU XIN ZHAO WEI

Free format text: CORRECT: ADDRESS; FROM: 300384 BINHAI NEW DISTRICT, TIANJIN TO: 100029 CHAOYANG, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20130807

Address after: 100029 Beijing city Chaoyang District Yumin Road No. 3

Applicant after: State Computer Network and Information Safety Management Center

Applicant after: Tianjin NanKai University General Data Technologies Co., Ltd.

Address before: Haitai 300384 in Tianjin Binhai high tech Zone Huayuan Industrial Zone Development six road No. 6 Haitai green industry base J

Applicant before: Tianjin Nanda General Data Technology Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant