CN106202154A

CN106202154A - A kind of inverted index based on data de-duplication framework represents method and system

Info

Publication number: CN106202154A
Application number: CN201610464499.0A
Authority: CN
Inventors: 刘晓光; 张曌华; 梁津; 李天龙; 童健聪; 黄海兵; 王刚
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2016-06-21
Filing date: 2016-06-21
Publication date: 2016-12-07
Anticipated expiration: 2036-06-21
Also published as: CN106202154B

Abstract

A kind of inverted index based on data de-duplication framework represents method and system, it is adaptable to search engine and community network data process.Including: 1. the Inverted List in traversal inverted index, identify and record the sequence pattern repeated between different Inverted List.2. calculate the length of described each sequence pattern, and carry out corresponding operating according to length.According to the lexcographical order of sequence pattern, for described each sequence pattern allocation model sequence number.3. according to described sequence pattern, inverted index is carried out reduction, store the Inverted List after sequence pattern and reduction respectively.4. difference processes: document sequence number adjacent in sequence pattern is carried out mathematic interpolation.Mode index is represented as two tuples, logging mode sequence number and the position offset of neighboring modes sequence number.The present invention can effectively delete the repetition data in inverted index, reduces document serial numbers, improves the compression ratio of inverted index, shortens the query responding time of search engine simultaneously, improves Consumer's Experience.

Description

A kind of inverted index based on data de-duplication framework represents method and system

Technical field

The invention belongs to the compressing inverted index technical field of search engine, particularly to one based on data de-duplication The inverted index of framework represents method and system.Present disclosure applies equally to data compression problem based on community network figure and look into Inquiry is inscribed.

Background technology

Inverted index is most popular data structure in modern search engines, and it is by dictionary and Inverted List two parts Composition.Lexical item, the document frequency of lexical item and the sensing that wherein dictionary obtains after saving and processing collection of document The pointer of Inverted List corresponding to this lexical item；Inverted List is made up of multiple rows of falling record, the most each row of falling record correspondence bag A document containing this lexical item, in the row's of falling record, the information of record includes: document sequence number (referred to as docID), lexical item frequency (lexical item The number of times occurred in the document), positional information (lexical item appearance position in a document) etc..In the present invention, it will be assumed that Each Inverted List is only made up of a series of docID.Concrete structure schematic diagram is with reference to Fig. 3.

Along with the fast development of the Internet, the memory space that on the one hand inverted index takies drastically expands, and on the other hand sweeps Time needed for retouching inverted index is longer, reduces the query processing efficiency of search engine.In order to overcome index data scale to hold The problem that continuous growth is brought, there has been proposed the method being compressed inverted index in a large number.Inverted index is compressed, Can not only effectively reduce the memory space that inverted index takies, query processing efficiency can be significantly improved simultaneously.

To any lexical item t, the Inverted List of its correspondence is typically represented by: < d₁,d₂,d₃,…,d_ft>, wherein ft is this word The document frequency of item, d₁,d₂,d₃,…,d_ftFor original document sequence number.Owing to the document sequence number in Inverted List is to arrange by ascending order Row, therefore it has been proposed that employing d-gap form to represent document sequence number, i.e. to each document sequence number, by the document sequence number and The difference of the most adjacent document sequence number represents (except first document sequence number), thus obtains the arrangement of falling of following form Table: < d₁,d₂-d₁,d₃-d₂,…,d_ft-d_ft-1>, this list is carried out variable-length encoding the most again and reaches the effect of compression.Because it is civilian Difference (d-gap) between shelves sequence number will be much smaller than original document sequence number, and the least coding bit wide being intended to of numerical value It is the shortest, so its compression ratio of the inverted index of d-gap form is higher than the inverted index of general type.

Although the mean values that the inverted index of d-gap form needs coding is less, but compression process needs coding Numerical value number not do not reduce.By observing it was found that inverted index also exists substantial amounts of repetition data division.Although There is no the document sequence number repeated in same list, but different lists but may comprise identical document sequence.Illustrate, Assume Inverted List l₁, l₂And l₃It is respectively provided with following form:

l₁→{1,2,5,14,20,39,40,41,42}

l₂→{1,2,5,6,9,10,14,16,39,40,41,50}

l₃→{1,2,5,11,14,39,40,41,43,50}

Visible l₁, l₂And l₃All comprise document sequence 1,2,5} and 39,40,41}, if we make " A={1,2,5} ", " B={39,40,41} ", then l₁, l₂And l₃Reduction can be distinguished and become following form:

l’₁→{A,14,20,B,42}

l’₂→{A,6,9,10,14,16,B,50}

l’₃→{A,11,14,B,43,50}

Obviously | A |+| B |+| l '₁|+|l’₂|+|l’₃|<|l₁|+|l₂|+|l₃|, i.e. need the numerical value number of coding to reduce , and needing the number encoded the fewest, memory space required after coding is the least.

In addition to search engine, in process community network opens up the relation data of figure, inverted index model still has can fit The property used.A kind of typical application scenarios is in a complete community network, there is certain relation, example between user and other users As mutually subscribed between user, concern etc. each other, collectively referred to here in as friends.If describing problem by graph theory, it is simply that there is this One, sample figure: each point in figure represents certain user determined, the limit that there is connection arbitrfary point A and B represents user A and user B is friends.Under this topological structure, a kind of typical case's application is with certain character in all friends searching for certain user String is the user of name prefix, and returns candidate result collection.

For above-mentioned application scenarios, presently, there are two kinds of solutions.First is each point and have phase with it in record figure Other points of adjacent relation.When query processing, all other points being connected with it can be traveled through successively, be judged by character string contrast Whether the user name that this point is corresponding comprises inquiry string as prefix.One of shortcoming of this method is to need traversal all Friend's node, the node wherein comprising inquiry string prefix the most only occupies the minority；Next to that string matching scanning can consume More calculate resource.More the scheme of performance advantage be use inverted index structure storage relation data, set up user and its One-to-many mapping relations (i.e. user's Inverted List) between friend User.Further according to statistical result, set up character string with comprise it make For the one-to-many mapping relations (i.e. character string Inverted List) between all users of name prefix.When query processing, need to divide Du Qu not inquire about the character string Inverted List that user's Inverted List of user is corresponding with inquiry, both common factors are candidate result Collection.Under this model, compression method and the search algorithm of general inverted index are equally applicable.

Summary of the invention

Present invention aim to address what existing compressing inverted index method needed to encode each document sequence number Problem, it is provided that a kind of novel inverted index based on data de-duplication framework represents method and system, it is possible to effectively delete Except the repetition data in inverted index, reduce document serial numbers to be encoded, improve the compression ratio of inverted index.

Present invention firstly provides a kind of inverted index based on data de-duplication framework and represent method, with reference to Fig. 1, its Key step includes:

Step 1 (S101), every Inverted List in traversal inverted index, identify and record weight between different Inverted List Appear again existing sequence pattern；

Step 2 (S102), the modal length of the repetitive sequence pattern identified in calculation procedure 1, carry out according to modal length Corresponding operating: when modal length is less than threshold value k, delete this pattern；When modal length is more than or equal to threshold value k, retain this mould Formula.Wherein the value of threshold value k is between 4 to 6.Afterwards according to the lexcographical order of described each sequence pattern, for described each sequence mould Formula allocation model sequence number；

Step 3 (S103), carries out reduction according to the sequence pattern that step 2 is deleted after simplifying to inverted index, distinguishes afterwards Inverted List after storage sequence pattern and reduction, retains its modal length and mode index for each sequence pattern；

Step 4 (S104). carry out difference process: difference meter is carried out for the adjacent document sequence number in sequence pattern content Calculate；

When previous element is document sequence number, for the document sequence number in each Inverted List after reduction, except first Individual element preserves outside original value, and surplus element all deducts the original value of previous adjacent element；When previous element is pattern sequence Number time, deduct the greatest member of previous flanking sequence pattern；

For the mode index in Inverted List, in addition to first element preserves original value, surplus element all deducts therewith Nearest previous mode index original value；Each mode index is expressed as two tuples, poor including mode index or mode index Value and the position offset of next mode index, first pattern sequence in the most front segment record list of the most each Inverted List Number position offset, obtain new inverted index；

Wherein said sequence pattern is the document sequence repeated between different Inverted List, and described modal length is described Document sequence number number in sequence pattern.

For achieving the above object, present invention also offers a kind of inverted index table based on data de-duplication framework Show system.With reference to Fig. 2, this system includes:

Pattern recognition module, for traveling through every Inverted List in inverted index, identifies and records different Inverted List Between the sequence pattern that repeats；

Pattern simplifies module, is on the basis of the result obtained from pattern recognition module, calculates described each sequence pattern Modal length, carries out corresponding operating according to described modal length: when modal length is less than threshold value k, deletes this pattern, work as pattern When length is more than or equal to threshold value k, retain this pattern, afterwards according to the lexcographical order of described each sequence pattern, for described each sequence mould Formula allocation model sequence number；

Index reduction module, simplifies the sequence pattern after the deletion simplification that module obtains according to pattern, enters inverted index Row reduction, stores the Inverted List after sequence pattern and reduction the most respectively, wherein retains its pattern for each sequence pattern long Degree and mode index；

Difference processing module, for the Inverted List after the index sequence pattern of reduction module stores and reduction, poor Value processes: for the Inverted List after sequence pattern and reduction, calculates the difference between adjacent element respectively and replaces primitive element； For each mode index in Inverted List, the position skew of LSN or sequence number difference and next mode index respectively Amount, obtains new inverted index；

Advantages of the present invention and having the beneficial effects that, effectively deletes the repetition data in inverted index, reduces to be encoded Document serial numbers, improve inverted index compression ratio, the present invention can be widely used in Performance of Search Engine optimization and fall Row index compression field.

Accompanying drawing explanation

Fig. 1 is that the inverted index based on data de-duplication framework of the present invention represents method flow diagram；

Fig. 2 is that the inverted index based on data de-duplication framework of the present invention represents system schematic；

Fig. 3 is inverted index basic structure schematic diagram of the prior art；

Detailed description of the invention

For ease of understanding the above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the accompanying drawings with detailed description of the invention to this Invention is described in further detail.

Embodiment 1,

Inverted index based on data de-duplication framework represents method, and its flow process sees Fig. 1.For realizing described method Inverted index represent system, see Fig. 2.

We call a sequence of interval numerical value continuous print document sequence number sequence in Inverted List, such as sequence 10, 11,12,13,14} are properly termed as a sequence of interval, and sequence { 10,11,13,14} then comprise two sequence of interval, first Sequence of interval is that { 10,11}, second sequence of interval is { 13,14}.By observing it was found that Inverted List also exists big Measure such sequence of interval, the most in the present invention, it is proposed that two kinds identify the plan of repetitive file sequence between different lists Slightly: C1. identifies the document sequence arbitrarily repeated；The most only identify the sequence of interval of repetition.It is sequence of interval for sequence pattern Situation, we use the method for expressing of run-length, for being optimized the storage mode of sequence pattern, thus further Improve the compression ratio of inverted index.The method for expressing of sequence pattern be described below:

Relative strategy C1, it is assumed that a given sequence pattern M comprising n document sequence number:

M = {d_{1}, d_{2}, ..., d_{n}}, d_{i + 1} > d_{i}, &ForAll; i > 1

Using method for expressing retained-mode length n simultaneously of d-gap, its corresponding form is:

M_d-gap={ n, d₁,d₂-d₁,...,d_n-d_n-1}

Relative strategy C2, it is assumed that a given sequence pattern M comprising n document sequence number:

M = {d_{1}, d_{2}, ..., d_{n}}, d_{i + 1} = d_{i} + 1, &ForAll; i > 1

Using method for expressing retained-mode length n simultaneously of run-length, its corresponding form is:

M_run-length={ n, d₁}

Why sequence pattern based on sequence of interval is expressed as the form of { modal length, first document sequence number }, be because of For representing these patterns by run-length, we only need 2 integers just can represent the situation of original n document sequence number, It is thus desirable to the integer of storage can to reduce (n-2) individual, and the value of n is the biggest, and the integer number that run-length reduces is the most.

Assume to there is such a inverted index, including three Inverted List l₁, l₂And l₃:

l₁→{1,2,3,14,20,21,39,40,49,51,55}

l₂→{1,2,3,9,10,11,14,21,39,40,49,55}

l₃→{1,2,3,14,16,39,49,53,55}

After step S101, S102 and S103, inverted index based on data de-duplication framework can be obtained and represent:

A→{1,2,3}

B→{21,39,40,49}

l’₁→{A,14,20,B,51,55}

l’₂→{A,9,10,11,14,B,55}

l’₃→{A,14,16,39,49,53,55}

First to Inverted List l '₁, l '₂And l '₃In document sequence number carry out difference process, specific rules is: 1) each fall First document sequence number in permutation table keeps constant；2) for remaining each document sequence number, if previous adjacent element It is that document sequence number then deducts it and previous adjacent element；3) or when previous adjacent element be that mode index then deducts it and this Maximum document sequence number in the sequence pattern that mode index is corresponding.Therefore the Inverted List sequence after we can be processed:

l”₁→{A,11,6,B,2,4}

l”₂→{A,6,1,1,3,B,6}

l”₃→{A,11,2,23,10,4,2}

Then we describe each mode index in Inverted List by integer two tuple, and first value of two tuples is mould Formula sequence number (when first mode index in this mode index is list) or mode index with and its nearest previous mould The difference of formula sequence number (when first mode index in this mode index non-list).Second value of two tuples is in list Position offset between next mode index and this mode index is (if this mode index is last pattern in list Sequence number, then this value is 0).Lexcographical order size according to A and B sequence content, we are respectively A, B by ascending order and distribute numbering 1 and 2. The most above-mentioned Inverted List sequence can be described as:

l”₁→{(1,3),11,6,(1,0),2,4}

l”₂→{(1,5),6,1,1,3,(1,0),6}

l”₃→{(1,0),11,2,23,10,4,2}

For there is the Inverted List of mode index, it would be desirable to one integer of extra storage is in order to indicate first pattern Sequence number position offset in Inverted List.For above-mentioned example, first mode index of each Inverted List be A and Being all the header element in list, therefore side-play amount unification is 1.In addition it is also necessary to it is poor to the document sequence number in sequence pattern Value processes, and i.e. in addition to first document sequence number holding is constant, surplus element all deducts previous adjacent element.Final arranges rope Draw and can be described as:

A→{1,1,1}

B→{21,18,1,9}

l”₁→{(1,3),11,6,(1,0),2,4}

l”₂→{(1,5),6,1,1,3,(1,0),6}

l”₃→{(1,0),11,2,23,10,4,2}。

Embodiment 2

We compared for the ratio after various forms of index encodes needed for each document sequence number on TREC GOV2 data set Special number and the decompression speed of correspondence, wherein EF represents the inverted index table encoded based on optimal segmentation strategy and Elias-Fano Show method；TD represents inverted index based on traditional d-gap, and R represents that index based on data de-duplication framework represents shape Formula (I and II represents used repetitive sequence recognition strategy, the respectively corresponding tactful C1 described above and strategy C2).To used Inverted index data set does as described below:

(1) TREC GOV2 is the data set captured from .gov domain name for 2004, comprises more than 2,500 ten thousand webpages altogether；

(2) we use TREC 2009 query set as inquiry test set, comprise 32244 inquiries altogether, are used for testing respectively The index of kind of form averagely decompresses speed for this query set；

(3) URL represent according to web page address, GOV2 data set is reset after the data set of gained, TMF and IBDA is The data set of gained after GOV2 data set is reset.

Table 1

In Table 1, under we compared for multi-form, different re-arrangement strategy, each inverted index represents the actual compression of method Effect, wherein R and TD method all uses OptPForD coded method.From experimental result it can be seen that based on data de-duplication The index of framework by above-mentioned coding, its compression ratio be better than tradition d-gap form index, compression ratio all can improve 10% with On；Compared with EF method, R-I and R-II still keeps some superiority.Additionally, due to R-I can identify more sequence mould than R-II Formula, therefore its compression effectiveness is more preferable.

Table 2

Table 2 gives decompression velocity contrast's result of correspondence.It is not related to significantly decompress in encoding due to Elias-Fano Restoring operation, therefore eliminates the statistical result of correspondence in table.From experimental result it can be seen that based on data de-duplication framework Index be more conducive to, compared to the index of traditional d-gap form, the decoding that counts, wherein arrange resets based on IBDA and TMF On index, R method can obtain and significantly decompress acceleration effect.And compared to R-I, owing to R-II is by modal length and head literary composition Shelves sequence number can recover complete sequence pattern, therefore saves a large amount of accessing operation, therefore its decompression speed is generally greater than R-I Decompression speed.

For community network data, we compared for above-mentioned several rope on the disclosed relational dataset of Facebook part Drawing the actual compression effect of method for expressing, each method title is identical with above-mentioned experimental section with implication.This data set comprises About 51,000,000 users, and have recorded the subscribing relationship between user.It is inverted index form that initial data is arranged by we, and root Add up inquiry according to query set and relate to the decompression speed of data.

Table 3

Table 4

Table 3 compared for original order, ID is carried out IBDA and TMF reset after, the actual compression effect of inverted index. Therefrom it will be seen that combine IBDA and TMF re-arrangement strategy, two kinds of R methods perform better than on compression effectiveness.Wherein R-I EF method, remains to keep certain advantage relatively.And compare traditional TD method, R method possesses higher compression ratio all the time.Table Under the 4 different re-arrangement strategy of contrast, the decompression speed of each method.Therefrom it will be seen that R in addition to original order, after rearrangement Method has the decompression speed of about 3.8%-17.1% to promote than TD method.

Inverted index to the present invention represents that method and system are described in detail above, applies concrete in the present invention Principle and the embodiment of the present invention are illustrated by individual example, and the explanation of above example is only intended to help to understand the present invention's Method and core concept thereof；Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, in specific embodiment party All will change in formula and range of application, in sum, this specification content should not be construed as limitation of the present invention.

Claims

1. an inverted index based on data de-duplication framework represents method, it is characterised in that including:

Step 1, every Inverted List in traversal inverted index, identify and record the sequence repeated between different Inverted List Row pattern；

Step 2, the modal length of the repetitive sequence pattern identified in calculation procedure 1, carry out corresponding operating according to modal length: when When modal length is less than threshold value k, delete this pattern；When modal length is more than or equal to threshold value k, retain this pattern；Basis afterwards The lexcographical order of described each sequence pattern, for described each sequence pattern allocation model sequence number；

Step 3, deletes the sequence pattern after simplifying according to step 2 and inverted index carries out reduction, the most respectively storage sequence mould Inverted List after formula and reduction, retains its modal length and mode index for each sequence pattern；

Step 4. carries out difference process: carry out mathematic interpolation for the adjacent document sequence number in sequence pattern content；

When previous element is document sequence number, for the document sequence number in each Inverted List after reduction, except first unit Element preserves outside original value, and surplus element all deducts the original value of previous adjacent element；When previous element is mode index, Deduct the greatest member of previous flanking sequence pattern；

For the mode index in Inverted List, in addition to first element preserves original value, surplus element all deducts the most nearest Previous mode index original value；Each mode index is expressed as two tuples, including mode index or mode index difference and The position offset of next mode index, first mode index in the most front segment record list of the most each Inverted List Position offset, obtains new inverted index；

Wherein said sequence pattern is the document sequence repeated between different Inverted List, and described modal length is described sequence Document sequence number number in pattern.

2. an inverted index based on data de-duplication framework represents system, it is characterised in that including:

Pattern recognition module, for traveling through every Inverted List in inverted index, identifies and records between different Inverted List The sequence pattern repeated；

Pattern simplifies module, is on the basis of the result obtained from pattern recognition module, calculates the pattern of described each sequence pattern Length, carries out corresponding operating according to described modal length: when modal length is less than threshold value k, deletes this pattern, work as modal length During more than or equal to threshold value k, retain this pattern, afterwards according to the lexcographical order of described each sequence pattern, divide for described each sequence pattern Join mode index；

Index reduction module, simplifies the sequence pattern after the deletion simplification that module obtains according to pattern, returns inverted index About, the most respectively storage sequence pattern and reduction after Inverted List, wherein for each sequence pattern retain its modal length and Mode index；

Difference processing module, for the Inverted List after the index sequence pattern of reduction module stores and reduction, is carried out at difference Reason: for the Inverted List after sequence pattern and reduction, calculates the difference between adjacent element respectively and replaces primitive element；For Each mode index in Inverted List, difference LSN or sequence number difference and the position offset of next mode index, Obtain new inverted index；