CN103020299A - Storage method and device for inverted indexes and appended data in full-text search - Google Patents

Storage method and device for inverted indexes and appended data in full-text search Download PDF

Info

Publication number
CN103020299A
CN103020299A CN2012105919899A CN201210591989A CN103020299A CN 103020299 A CN103020299 A CN 103020299A CN 2012105919899 A CN2012105919899 A CN 2012105919899A CN 201210591989 A CN201210591989 A CN 201210591989A CN 103020299 A CN103020299 A CN 103020299A
Authority
CN
China
Prior art keywords
data
indexing units
tree
units data
indexing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105919899A
Other languages
Chinese (zh)
Other versions
CN103020299B (en
Inventor
张学
范振勇
崔维力
武新
赵伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin NanKai University General Data Technologies Co., Ltd.
National Computer Network and Information Security Management Center
Original Assignee
TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd filed Critical TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Priority to CN201210591989.9A priority Critical patent/CN103020299B/en
Publication of CN103020299A publication Critical patent/CN103020299A/en
Application granted granted Critical
Publication of CN103020299B publication Critical patent/CN103020299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for efficiently storing inverted indexes in a full-text search system. The method comprises the following steps: detecting whether the length of index unit data is larger than a threshold value K or not; and if the index unit data is larger than n*K and less than (n+1)*K (n is a natural number), storing the index unit data from a beginning part to an n*K part into the index unit data block; storing the rest index unit data into a tree B; and if the index unit data is equal to n*K, storing the index unit data from the beginning part to the n*K part into the index unit data block; and if the index unit data is less than K, storing all index unit data into the tree B. The method for efficiently storing the inverted indexes in the full-text search system, provided by the invention, has the advantages as follows: the storing efficiency of the full-text search of an inverted file can be effectively increased, the data reading speed is increased, cory on write mechanism can be conveniently achieved, data safety is enhanced, and the parallel index for reading data is further increased.

Description

The store method of inverted index and supplemental data thereof and memory storage in the full-text search
Technical field
The invention belongs to field of data storage, especially relate to a kind of in full-text search store method and the memory storage of inverted index and supplemental data thereof
Background technology
In relational database system, full-text index is one of mode of search file data full blast, at current net environment, quantity of information all becomes volatile growth with customer volume, full-text index becomes one of Main Means of information retrieval system, inverted index is the core of text retrieval system, and its storage organization also has a great impact the text retrieval system performance.
(English: Inverted index), also often be called as reverse indexing, insert archives or reverse archives, be a kind of indexing means to inverted index, is used to be stored in the mapping of the memory location in a document or one group of document of certain word under the full-text search.It is data structure the most frequently used in the DRS.By inverted index, can comprise according to the word quick obtaining lists of documents of this word.
Full-text index inverted entry data are comprised of Term ID corresponding one group of document code and the skew in document, its form of expression is Term ID--〉{<doc ID, { offset}}, wherein Term ID is the minimum index unit that the hyphenation device is divided, in Chinese Full Text Retrieval, be generally the combination of word, word, English, numeric string and several forms, have following characteristics:
1. the corresponding row of falling of different Term differs greatly according to length, common word as " ", " " etc. the character suitable height of the frequency of occurrences often, the special string that only occurs is once also arranged.
2. vocabulary is huge in the full-text index, and often a full-text search storehouse can have ten million vocabulary, and each vocabulary can take a large amount of storage spaces as a retrieval unit
Full-text index is arranged data storage generally two kinds: one, adopt empirical data that known dictionary is divided into mode middle and high, low several grades by word frequency, every grade of word frequency adopts different data block size storages, every blocks of data of storage high frequency words is larger, otherwise the low-limit frequency piece is minimum.The benefit of this mode is that the disk waste is less, reading efficiency is secure, is not inconsistent or when new vocabulary occurring, larger variation occurs word frequency in case shortcoming is real data and empirical data, the waste of high frequency words space then occurs, low-frequency word word chain overlength causes reading efficiency low.Two, adopt the mode of dividing large data cell, be a unit such as every 1GB data, all inverted entry data are pressed entry numbering, document code, side-play amount (TermID, Doc ID, Offset) ordering in each unit, before index is finished all data cells are carried out data and merge into a final unit.The advantage of this mode is that the space waste is 0, and shortcoming is that final data sorting needs one times of disk remaining space, and causing the disk space utilization factor is 50%, and the time of merging is longer.
Summary of the invention
The problem to be solved in the present invention provides a kind of method of the efficient preservation index in text retrieval system, can fully improve the disk space utilization factor.
For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of at text retrieval system
In the method for efficient preservation index, comprising:
1) magnitude relationship of comparison indexing units data length and the threshold values K that presets;
2) if the indexing units data length less than K, all deposits the indexing units data in the B tree;
3) if the indexing units data length equals K, deposit the indexing units data in indexing units data data block from the beginning to the part of K;
4) if greater than K, then compare indexing units data length and n*K(n=2,3 ...) magnitude relationship, and store according to following manner:
1. if the indexing units data length is greater than (n-1) * K and less than n*K, partly deposit the indexing units data in the indexing units data block from the beginning to n*K, remainder is deposited in the B tree;
2. if the indexing units data length equals n*K, deposit in order all indexing units data in the indexing units data block.
Further, described B tree can be stored at least two length are stored in block device remaining data afterwards greater than the K value less than K value or length indexing units data.
Further, described B tree is a kind of distortion of B+ tree, a kind of B+ tree that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree.
According to a further aspect in the invention, the present invention also provides a kind of memory storage of efficient preservation indexing means, comprising:
Storage data block unit is used for storing the indexing units data of the integral multiple of regular length;
B sets the memory page unit, is used for storing the remainder of indexing units data deficiencies K.
Further, described B tree is a kind of distortion of B+ tree, a kind of B+ tree that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree.
The full-text index data storage method of appending according to the memory storage of a kind of efficient preservation indexing means provided by the present invention:
1) calculates the indexing units data in the former B tree and the indexing units data sum of appending;
2) if indexing units data length sum less than n*K and greater than (n+1) * K(n=2,3 ...), the part of B tree storage taken out and deposit the part of indexing units data from the part of B tree storage to n*K in indexing units data data block, will remain the indexing units data and deposit in during B sets;
3) if indexing units data sum equals n*K, deposit indexing units data data block in the part taking-up of B tree storage and with the part of indexing units data from B tree storage area to n*K;
4) if, then will appending part less than K, indexing units data sum deposits successively former B tree storage area in.
Owing to adopt technique scheme, can effectively improve the storage efficiency of the full-text index of inverted entry, reduced the waste of disk space, Effective Raise data read rates; Can conveniently realize copy-on-write (Copy On Write) mechanism, and then improve the concurrent index of data security and reading out data.
Description of drawings
Fig. 1 is the store method schematic flow sheet of efficient preservation indexing means of the present invention and supplemental data
Fig. 2 is the storage synoptic diagram of an example among the present invention
Fig. 3 is the synoptic diagram of realizing copy-on-write mechanism in the example of the present invention
Embodiment
Implement to use in the scene at one of the present invention, according to setting up in full corresponding full-text index inverted entry data, its representation is Term ID(entry numbering)--〉<doc ID(document code), the offset}(side-play amount) }, adopt in this example the mode of B tree combined block file to store the Key-Value data, Key is the entry numbering, Value is the inverted entry data, elongated data, when creating index data, the data of Value are constantly appended, when length surpasses default threshold value n*K(n=1,2,3 ...) time, the length of extracting Value from the beginning is that the data of n*K store module unit into, for the index data less than k of remainder, chain of corresponding key value part composition in the inverted entry data that module unit adopts fixed-length data to preserve corresponding TermId, each data block and B tree.Thereby form the one-tenth piece part that the block file data are preserved inverted entry, preserve the fragment of inverted entry in the B tree.
Here threshold values in the setting of K value be to determine that with reference to storage inverted entry data storage device performance generally all adopt the minimum unit of Computer Storage, such as 32k, 64k etc. make things convenient for magnetic disc array storage data.
As shown in Figure 2, three index minimum units that total hyphenation device delimited in supposing in full, be numbered respectively Term ID(1), Term ID(2) and Term ID(3), Term ID(1 wherein) is a character string that few number of times only occurred in the text, when its inverted entry index is stored, shown among the figure 21, the data length of inverted entry index Term ID(1) does not surpass K, be Term ID(1)-value<k, its index data is deposited in the B tree memory page.
Term ID(2) is another character string, the number of times that occurs in the text is more, when its inverted entry index is stored, shown among the figure 22, the threshold values K that the data length of inverted entry index Term ID(2) equals to preset, i.e. Term ID(1)-value=nk(n=1,2,3 ...), equally its index data is deposited into the storage block unit.
Term ID(3) is another character string in full-text index, often occur in the text, when its inverted entry index is stored, as shown in figure 23, the threshold values K that the data length of inverted entry index Term ID(2) equals to preset, i.e. (n+1) * k〉Term ID(1)-value〉n*k, the length of extracting Value from the beginning is that the data of K store the storage block unit into, remainder deposits in the B tree, the chain of corresponding key value part composition in data block and the B tree.
In practice, the situation that often has the inverted entry data supplementing occurs, situation according to inverted entry data different length, different corresponding stored methods is still arranged, Term ID (1) in the example still, illustrate, Term ID(1) a small amount of supplemental data is arranged, with Term ID(1) in index data length and the addition of supplemental data length in the B tree memory page, if the Term ID(1 after the addition)-value=nk, then the index data in the former B tree memory page is taken out, section and the supplemental data of the index data from B tree memory page deposit data block successively in, and deletion B tree key assignments is the data of TermID; If (n+1) K (n=1 after the addition, 2,3 ...) Term ID (1)-value nk (n=1,2,3 ...), then the index data in the former B tree memory page is taken out, and will deposit in the storage block in order from the beginning to the part that is added to nk with original index data in the index data in the former B tree memory page and the supplemental data, and remaining supplemental data is updated among the TermID identical in the B tree; If the indexing units data length addition of the indexing units data of appending and the storage of original B tree still less than K, then deposits supplemental data in the B tree memory page successively.
Long tail effect according to large data, the data that most of entry is corresponding in the inverted entry in the full-text search are considerably less, be not enough to reach the block file unit-sized, mechanism by the elongated Value of B tree page or leaf preservation Key-, in a page or leaf, can preserve the fragmentary data of the inverted entry of a plurality of Key, and the default disk utilization of B tree is 75%, then total disk utilization is that (B tree file size * 75%+ block file size)/(B tree file size+block file size) when block file was larger, always disk utilization trend 100%(was 99.5% in 1,000 ten thousand news type document test datas).
The page of b-tree indexed and the module unit of block file all are fixed-length datas, conveniently directly locate and use cache algorithm, wherein the page size of B tree=4 * module unit size.In actual items, can consider the memory size, disk read-write efficient and other factors that can be used for Cache, finally the size of each module unit of the Page size of definite B tree and block file.Block file can utilize system to put in order free time, and the certificate of falling the row of same Term is put into continuous disk space.
For the entry of the certificate of falling the row less than a module unit, read consuming time is exactly read the B tree consuming time at every turn, it reads the disk number of times is B tree degree of depth L time, the entry that surpasses a module unit for the certificate of falling the row, the cost of each complete reading out data is to read B tree and read blocks of data one time, consider in the full-text index commonly used with or etc. logical operation, the certificate of falling the row generally is to read at random, then can take full advantage of the fixed length characteristic of every of block file, the final data reading performance using redundancy that effectively improves.
When full-text index creates index in real time, produce the supplementary demand of the certificate of falling the row, B tree causes and is the stratiform data characteristic, can realize easily copy-on-write mechanism, because the B tree is a tree structure, B just copies first portion before setting each page modification during modification, the memory page of direct or this page or leaf of indirect referencing is also created one to be copied, all be modified in to copy make amendment, revise front and latter two version of modification thereby formed, version carries out read operation before modification, and version carries out write operation after modification, both are independent of each other, and have improved concurrency performance.The piece storage also is that so data are only appended at end-of-file, can realize equally during access that read-write separates, and is independent of each other, and improves concurrent.When data are submitted to successfully, write version conversion for reading version, if submit to unsuccessfully, then write version and abandon, thereby guaranteed the integrality of data.As shown in Figure 3, the two versions of read-write can be finished real-time supplemental data and improve the limit and build the concurrency performance of just searching.For the convenient COW mechanism that realizes, the B tree is adopted the distortion of B+ tree, and its difference with the B+ tree has been to remove the pointer of the sensing brotgher of node of preserving in the leaf node, otherwise need to copy the phenomenon that whole B sets can occur revising the time.
More than one embodiment of the present of invention are had been described in detail, but described content only is preferred embodiment of the present invention, can not be considered to be used to limiting practical range of the present invention.All equalizations of doing according to the present patent application scope change and improve etc., all should still belong within the patent covering scope of the present invention.

Claims (6)

1. the method for the efficient preservation inverted index in text retrieval system comprises:
1) magnitude relationship of comparison indexing units data length and the threshold values K that presets; If the indexing units data length less than K, all deposits the indexing units data in the B tree;
2) if the indexing units data length less than K, all deposits the indexing units data in the B tree;
3) if the indexing units data length equals K, deposit the indexing units data in indexing units data data block from the beginning to the part of K;
4) if greater than K, then compare indexing units data length and n*K(n=2,3 ...) magnitude relationship, and store according to following manner:
1. if the indexing units data length is greater than (n-1) * K and less than n*K, partly deposit the indexing units data in the indexing units data block from the beginning to n*K, remainder is deposited in the B tree;
2. if the indexing units data length equals n*K, deposit in order all indexing units data in the indexing units data block.
2. the method for efficient preservation index according to claim 1 is characterized in that: described B tree is a kind of distortion of B+ tree, and a kind of B+ that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree sets.
3. memory storage of described efficient preservation indexing means according to claim 1 comprises:
Storage data block unit is used for storing the indexing units data of the integral multiple of regular length;
B sets the memory page unit, is used for storing the remainder of indexing units data deficiencies K.
4. memory storage according to claim 4 is characterized in that: described B tree memory page unit can be used to store at least two length are stored in block device remaining data afterwards greater than the K value less than K value or length indexing units data.
5. memory storage according to claim 4 is characterized in that: described B tree is a kind of distortion of B+ tree, and a kind of B+ that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree sets.
6. data supplementing storage means of efficiently preserving as claimed in claim 4 the memory storage of indexing means comprises:
1) calculates the indexing units data length in the former B tree and the indexing units data length sum of appending;
2) if indexing units data length sum less than n*K and greater than (n+1) * K(n=2,3 ...), the part of B tree storage taken out and deposit the part of indexing units data from the part of B tree storage to n*K in indexing units data data block, will remain the indexing units data and deposit in during B sets;
3) if indexing units data sum equals n*K, deposit indexing units data data block in the part taking-up of B tree storage and with the part of indexing units data from B tree storage area to n*K;
4) if, then will appending part less than K, indexing units data sum deposits successively former B tree storage area in.
CN201210591989.9A 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search Active CN103020299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210591989.9A CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210591989.9A CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Publications (2)

Publication Number Publication Date
CN103020299A true CN103020299A (en) 2013-04-03
CN103020299B CN103020299B (en) 2016-01-13

Family

ID=47968902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210591989.9A Active CN103020299B (en) 2012-12-29 2012-12-29 The store method of inverted index and supplemental data thereof and memory storage in full-text search

Country Status (1)

Country Link
CN (1) CN103020299B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015078273A1 (en) * 2013-11-29 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for search
CN106227677A (en) * 2016-07-20 2016-12-14 浪潮电子信息产业股份有限公司 Method for managing variable-length cache metadata
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN107491523A (en) * 2017-08-17 2017-12-19 三星(中国)半导体有限公司 The method and device of data storage object

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536509A (en) * 2003-04-11 2004-10-13 �Ҵ���˾ Inverted index storage method, inverted index mechanism and on-line updating method
US20080133574A1 (en) * 2006-11-27 2008-06-05 Taiga Fukushima Method, program and device for retrieving symbol strings, and method, program and device for generating trie thereof
CN101226553A (en) * 2008-02-03 2008-07-23 中兴通讯股份有限公司 Method and device for storing length-various field of embedded database
US20090037456A1 (en) * 2007-07-31 2009-02-05 Kirshenbaum Evan R Providing an index for a data store
CN101944108A (en) * 2010-09-07 2011-01-12 深圳市彩讯科技有限公司 Index file and establishing method thereof
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536509A (en) * 2003-04-11 2004-10-13 �Ҵ���˾ Inverted index storage method, inverted index mechanism and on-line updating method
US20080133574A1 (en) * 2006-11-27 2008-06-05 Taiga Fukushima Method, program and device for retrieving symbol strings, and method, program and device for generating trie thereof
US20090037456A1 (en) * 2007-07-31 2009-02-05 Kirshenbaum Evan R Providing an index for a data store
CN101226553A (en) * 2008-02-03 2008-07-23 中兴通讯股份有限公司 Method and device for storing length-various field of embedded database
CN101944108A (en) * 2010-09-07 2011-01-12 深圳市彩讯科技有限公司 Index file and establishing method thereof
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015078273A1 (en) * 2013-11-29 2015-06-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for search
US10452691B2 (en) 2013-11-29 2019-10-22 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating search results using inverted index
CN106227677A (en) * 2016-07-20 2016-12-14 浪潮电子信息产业股份有限公司 Method for managing variable-length cache metadata
CN106227677B (en) * 2016-07-20 2018-11-20 浪潮电子信息产业股份有限公司 Method for managing variable-length cache metadata
CN106776746A (en) * 2016-11-14 2017-05-31 天津南大通用数据技术股份有限公司 A kind of creation method and device of full-text index data
CN107491523A (en) * 2017-08-17 2017-12-19 三星(中国)半导体有限公司 The method and device of data storage object
CN107491523B (en) * 2017-08-17 2020-05-05 三星(中国)半导体有限公司 Method and device for storing data object

Also Published As

Publication number Publication date
CN103020299B (en) 2016-01-13

Similar Documents

Publication Publication Date Title
US10496621B2 (en) Columnar storage of a database index
Tsirogiannis et al. Query processing techniques for solid state drives
CN102831222B (en) Differential compression method based on data de-duplication
CN110188108B (en) Data storage method, device, system, computer equipment and storage medium
US9047301B2 (en) Method for optimizing the memory usage and performance of data deduplication storage systems
US7689574B2 (en) Index and method for extending and querying index
CN103019887B (en) Data back up method and device
Ahn et al. ForestDB: A fast key-value storage system for variable-length string keys
CN101968795B (en) Cache method for file system with changeable data block length
CN106201916B (en) A kind of nonvolatile cache method towards SSD
CN101963982A (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN107045531A (en) A kind of system and method for optimization HDFS small documents access
CN103488709A (en) Method and system for building indexes and method and system for retrieving indexes
CN105631003A (en) Intelligent index establishing, inquiring and maintaining method supporting mass data classification and counting
CN101866358A (en) Multidimensional interval querying method and system thereof
CN103176754A (en) Reading and storing method for massive amounts of small files
US9189408B1 (en) System and method of offline annotation of future accesses for improving performance of backup storage system
CN108021702A (en) Classification storage method, device, OLAP database system and medium based on LSM-tree
CN103020299B (en) The store method of inverted index and supplemental data thereof and memory storage in full-text search
CN103345496A (en) Multimedia information searching method and system
CN104050057B (en) Historical sensed data duplicate removal fragment eliminating method and system
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
CN101482839B (en) Electronic document increment memory processing method
CN104391961A (en) Read-write solution strategy for tens of millions of small file data
CN104375782A (en) Read-write solution method for ten-million-level small file data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: TIANJIN NANDA CONVENTIONAL DATA TECHNOLOGY CO., LT

Effective date: 20130807

Owner name: STATE COMPUTER NETWORK AND INFORMATION SAFETY MANA

Free format text: FORMER OWNER: TIANJIN NANDA GENERAL DATA TECHNOLOGY CO., LTD.

Effective date: 20130807

C41 Transfer of patent application or patent right or utility model
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Fan Zhenyong

Inventor after: Wu Zhen

Inventor after: Zhang Xue

Inventor after: Cui Weili

Inventor after: Wu Xin

Inventor after: Zhao Wei

Inventor before: Zhang Xue

Inventor before: Fan Zhenyong

Inventor before: Cui Weili

Inventor before: Wu Xin

Inventor before: Zhao Wei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: ZHANG XUE FAN ZHENYONG CUI WEILI WU XIN ZHAO WEI TO: FAN ZHENYONG WU ZHEN ZHANG XUE CUI WEILI WU XIN ZHAO WEI

Free format text: CORRECT: ADDRESS; FROM: 300384 BINHAI NEW DISTRICT, TIANJIN TO: 100029 CHAOYANG, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20130807

Address after: 100029 Beijing city Chaoyang District Yumin Road No. 3

Applicant after: State Computer Network and Information Safety Management Center

Applicant after: Tianjin NanKai University General Data Technologies Co., Ltd.

Address before: Haitai 300384 in Tianjin Binhai high tech Zone Huayuan Industrial Zone Development six road No. 6 Haitai green industry base J

Applicant before: Tianjin Nanda General Data Technology Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant