CN101075261A - Method and device for compressing index - Google Patents

Method and device for compressing index Download PDF

Info

Publication number
CN101075261A
CN101075261A CN 200710110850 CN200710110850A CN101075261A CN 101075261 A CN101075261 A CN 101075261A CN 200710110850 CN200710110850 CN 200710110850 CN 200710110850 A CN200710110850 A CN 200710110850A CN 101075261 A CN101075261 A CN 101075261A
Authority
CN
China
Prior art keywords
index
position data
mark
sign
section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710110850
Other languages
Chinese (zh)
Other versions
CN100498794C (en
Inventor
孙良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2007101108507A priority Critical patent/CN100498794C/en
Publication of CN101075261A publication Critical patent/CN101075261A/en
Application granted granted Critical
Publication of CN100498794C publication Critical patent/CN100498794C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method for compressing index identification includes fetching the first index identification judging scope existed by value of the first index identification, confirming out length unit and label by utilizing said scope, using the first index identification value presented by confirmed out length unit as the second index identification, using the second index identification and confirmed out label commonly to replace said first index identification.

Description

The method and apparatus of index compression
Technical field
The present invention relates to the search technique field, be meant the compression method and the device of index especially.
Background technology
In field of computer technology, Internet technology is rapidly developed.Along with Internet resources are more and more, search engine technique is rapidly developed.In search engine technique, the backstage index technology is the technology of core the most, and the performance of backstage index technology is directly connected to the speed of retrieval, result.
In the index technology of backstage, traditional file index mainly is to realize with inverted file.The structure of inverted file is the document information of preserving the keyword of each retrieval and comprising keyword.The process of retrieval is to specific retrieval string, finds out the collection of document that all comprise this retrieval string in the short period of time.
Wherein, the retrieval string is the expression formula for search that the user retrieves needs input, and it can comprise one or more keyword, and the centre separates with the space.In internet hunt, the space represents that the keyword before and after it will carry out the logical and search operaqtion between the keyword.Here the keyword of being mentioned is the character string of forming with one or more morpheme.Keyword can be continued cutting by Words partition system.If cutting is two morphemes, then claiming into this keyword again is 2 yuan of complex morphemes; If be syncopated as 3 morphemes, then become 3 yuan of complex morphemes.Morpheme is that minimum can be expressed independent semantic linguistic unit, and it is not subdivisible.In Chinese, its Chinese word for being syncopated as in the Words partition system; In English, it is basic English word or letter.
The collection of document that retrieves usually with document code, be that index sign (ID) is represented.Document id is that the collection of document that is retrieved is carried out unique number, guarantees that each document to corresponding unique ID, helps location the document.As shown in Figure 1, wherein, t represents a keyword that is retrieved, all documents that comprise this t constitute a set, and di represents a series of document ID of this t, Wdi, t represents the weights of keyword t in document di, and loci represents the offset (offset) that t occurs in document d.The inverted index file is exactly to be made up of N data item as shown in Figure 1, and the quantity of N equals entire document and is integrated into and obtains all different keyword summations in the process of retrieving.
For increasing compression effectiveness, generally before compressing, rewrite the index content of inverted file earlier, index content is meant each index numerical value in the inverted file, as di, loci etc.Index numerical value is sorted according to size, then with difference but not actual value is represented (d-gap).As d1, d2, d3, d4=(1,5,7,9) is d1 after representing with difference, d2, d3, d4=(1,4,2,2).Each di equal the value of self-position and preceding face amount and, as d3=d1+d2+d3=7.Index numerical value is compressed after difference represents carrying out.Present compression method can be divided into regular length and elongated compression.For present compression method, mainly contain UNARY (primitive encoding) compression method, Elias compression method and Golomb compression method, when adopting these compression methods to compress, because the restriction of its compression algorithm self, compressibility can only reach about 10%, therefore, adopt these compression methods unsatisfactory at present in effect to index compression.
Summary of the invention
In view of this, the invention reside in the method and apparatus that index compression is provided, to solve the problem of above-mentioned compression method to the compression effectiveness difference of index data.
For addressing the above problem, the invention provides the method for a kind of index sign compression,
Read first index sign, judge the affiliated scope of the numerical value of described first index sign, determine mark and long measure by the described affiliated scope of judging;
First index sign numerical value of representing with the described long measure of determining identifies as second index;
Described first index sign of the common replacement of mark of using described second index sign and determining.
Wherein,
Described first index sign, second index identify and are labeled as binary bit sequence.
Wherein,
The process of judging the affiliated scope of the numerical value that described first index identifies comprises:
Numerical value according to the bit sequence of first index sign is judged, judges that the numerical value of described first index sign belongs to corresponding numerical value interval or uncertainty numerical value interval.
Wherein,
Belong to corresponding numerical value interval if judge the numerical value of described first index sign, then the process of determining mark and long measure by the described affiliated scope of judging comprises:
Obtain the fixation mark and the fixed-length units of the interval unique correspondence of described numerical value, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
Wherein, described first index sign of described common replacement comprises:
Before second index sign, add the described mark of determining, the sign of second index behind the described interpolation mark is identified as described first index.
Wherein,
Belong to uncertainty numerical value interval if judge the numerical value of described first index sign, then the process of determining mark and long measure by the described affiliated scope of judging comprises:
Number of bits with byte under the numerical value of described first index sign takies marks off bit section according to the interval shared number of bits of the greatest measure in the numerical value interval;
Obtain the value of bit sequence in last bit section that can not divide again, be worth pairing numerical value interval, obtain the fixation mark and the fixed-length units of unique correspondence by judging this;
With the fixation mark of the pairing cycle labeling of the described bit section that marks off, described last bit section jointly as the described mark of determining, the number of bits that byte under described is taken is as variable-length unit, with variable-length unit as the described long measure of determining.
Wherein, described first index sign of described common replacement comprises:
With described second index sign, mark off bit section according to the interval shared number of bits of the greatest measure in the numerical value interval, before marking off bit section, add cycle labeling, in the end add the fixation mark of determining before the bit section that can not divide, second index sign of having added fixation mark and cycle labeling is replaced described first index sign.
The invention provides the device of a kind of index sign compression, comprising:
Reading unit is used to read first index sign;
Judging unit is used to the affiliated scope of numerical value of judging that described first index identifies;
Arithmetic element is used for determining mark and long measure by the described affiliated scope of judging;
Converting unit, the numerical value that first index that is used for that the described long measure of determining is represented identifies identifies as second index;
Replace the unit, be used to use described second index sign, reach described first index sign of the common replacement of the described mark of determining.
The invention provides a kind of method of index position data compression, comprising:
Document is divided into the section of predetermined quantity, searching character in the section of described predetermined quantity, and identify whether there is described character in each section, with the described sign that obtains after the retrieval as the first index position data;
Obtain to exist the section quantity of described character by the first index position data, judge the quantitative range under the section quantity of described acquisition, determine mark and long measure by the described quantitative range of judging;
Represent to exist in the first index position data section of described character according to the described long measure of determining, with the result after the expression as the second index position data;
The described first index position data of the common replacement of the mark that uses the described second index position data and determine.
Wherein, described primary importance data, second place data, be labeled as binary bit sequence.
Wherein, judge that the process of quantitative range comprises under the section quantity of described acquisition:
The section quantity of judging described acquisition belongs to corresponding quantity interval or uncertainty quantity interval.
Wherein, belong to corresponding quantity interval if judge the section quantity of described acquisition, the then described process of determining mark and long measure by the described quantitative range of judging comprises:
Obtain the fixation mark and the fixed-length units of the interval unique correspondence of described quantity, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
Wherein, the process of the described common replacement first index position data comprises:
Before the described second index position data, add determined fixation mark, the second index position data of adding fixation mark are replaced the first index position data.
Wherein, belong to uncertainty quantity interval if judge the section quantity of described acquisition, the then described process of determining mark and long measure by the described quantitative range of judging comprises:
With the section packets of described predetermined quantity, whether there is the group echo of described character in each group of record;
With the fixation mark of the interval unique correspondence of described uncertainty quantity and described group echo jointly as the described mark of determining;
The section place group that has described character is taken bit as the described long measure of determining.
Wherein, the process of the described common replacement first index position data comprises:
Before the described second index position data, add determined group echo, before group echo, add described definite fixation mark, the second index position data of adding group echo, fixation mark are replaced the first index position data.
The present invention also provides a kind of device of index position data compression, comprising:
The first index position data cell, be used for document is divided into the section of predetermined quantity, searching character in the section of described predetermined quantity, and identify whether there is described character in each section, with the described sign that obtains after the retrieval as the first index position data;
Judging unit is used for obtaining to exist by the first index position data section quantity of described character, judges the quantitative range under the section quantity of described acquisition;
Arithmetic element is used for determining mark and long measure by the described quantitative range of judging;
The second index position data cell is used for representing that according to the described long measure of determining there is the section of described character in the first index position data, with the expression after the result as the second index position data;
Replace the unit, the described first index position data of the common replacement of mark that are used to use the described second index position data and determine.
By the method and apparatus of index compression of the present invention, when the compressed index data,, can produce higher compressibility for index sign and index position data, generally reach about 10%-15%, with respect to the index compression of prior art, compressibility improves 5%.Owing to, therefore can effectively reduce the shared disk space of index data with being stored on the disk space after the index data compression.In the process of retrieving, read packed data to internal memory from disk, the back inquiry that decompresses comprises the keyword index data, because after the index data compression, the index data amount reduces.When therefore utilizing the present invention to retrieve, can reduce the I/O read-write amount between disk and internal memory, reduce time for reading, thereby improve inquiry response speed.
Description of drawings
Fig. 1 is the structural drawing of data item;
Fig. 2 is the process flow diagram of the embodiment of the invention one;
Fig. 3 is the process flow diagram of the embodiment of the invention two;
Fig. 4 is the structure drawing of device of the embodiment of the invention three;
Fig. 5 is the structure drawing of device of the embodiment of the invention four.
Embodiment
The present invention can reach good technical effect when index data is compressed.Present index data all is the 32bit position, as index datas such as index sign, index positions.Index datas such as index sign, index position can leave in the file, also the index sign can be left in separately in the file, and index position is left in another file.But these do not influence realization of the present invention.
Below in conjunction with accompanying drawing each embodiment of the present invention is elaborated.At first set forth the compression process of index sign.The index sign is to be used to locate the data that comprise the keyword document, when adopting the 32bit position to represent the index sign, can represent 32 powers different Serial No. of 2, and when these sequences were used for representing the index sign, occupation space well imagined it is very large.For this class index sign, describe the process of compression by the following examples one.Referring to Fig. 2,
Step S201: read first index sign;
At first, should read the index sign that will compress, for ease of follow-up narration, the index sign that compress is defined as first index sign, the index sign that obtains after compression process is finished is defined as second index sign.In embodiment one, first index sign is one 32 a bit sequence.
Step S202: judge the affiliated scope of the numerical value of described first index sign, determine mark and long measure by the described affiliated scope of judging;
Realizing the process of compression, is to remove nonsensical bit position or use shorter bit sequence to represent the bit sequence.Therefore, in embodiment one, the numerical division that identifies according to index goes out a plurality of different scopes.
Tag field (2bit) Implication is described Example
01 Represent the back that 1 group of 2bit unit is arranged " 0100 " represents integer " 0 "; " 0101 " represents integer " 1 "; By parity of reasoning, 4 integers between " 0 " (0100) of can encoding-" 3 " (0111)
10 Represent the back that 2 groups of 2bit unit are arranged " 100100 " represent integer " 4 "; " 100101 " represent integer " 15 "; By parity of reasoning, 12 integers between " 4 " (100100) of can encoding-" 15 " (101111)
11 Represent the back that 3 groups of 2bit unit are arranged " 11010000 " represent integer " 16 "; " 11111111 " represent integer " 63 "; By parity of reasoning, 48 integers between " 16 " (100100) of can encoding-" 63 " (11111111)
00 After representing the back that 3 groups of 2bit unit are arranged, also continue with new mark " 000100000100 " represents integer " 64 "; " 000100000101 " represents integer " 65 "; By parity of reasoning, and the number that corresponding integer " 63 " is above is encoded by multistage and to be realized
Table 1
In conjunction with above-mentioned table 1, the affiliated scope for the numerical value of first index sign provides four kinds of scopes.Wherein, first three the kind scope in the table 1 has the fixed numeric values interval, i.e. first kind " 0-3 ", second kind " 4-15 ", the third " 16-63 " for greater than numerical value 63 time, belong to the 4th kind of scope, and the 4th kind of scope is called uncertainty numerical value interval.Judging first index sign numerical value earlier, judge the affiliated scope of first index sign numerical value, is to belong to corresponding numerical value interval, still belongs to uncertainty numerical value interval.Thereby determine mark and long measure by the affiliated scope of judging.
For two examples two different situations are described respectively below.
First example is to belong to the situation that first index sign numerical value has corresponding numerical value interval of judging.If it is 0bit that the binary sequence of first index sign expression is high 28, low 4 is 1100, and represented numerical value is decimal system numerical value 12.By judging, can draw the affiliated numerical value interval of first index sign is between the 4-15, to identify pairing fixation mark be 10 in the tag field thereby can draw first index, represents that the needed fixed-length units of numerical value of first index sign is 4 bit positions.Thereby with fixation mark as the mark of determining, with fixed-length units as the long measure of determining.
Second example is to belong to the situation that first index sign numerical value has uncertainty numerical value interval of judging.If it is 0bit that the binary sequence of first index sign expression is high 25, low 7 is 1000001, and represented numerical value is decimal system numerical value 65.By judging, can draw the affiliated numerical value of first index sign greater than 63, belong to uncertainty numerical value interval, when determining mark and long measure, the bit figure place that byte under the numerical value of first index sign is taken marks off the bit section according to the interval shared bit figure place of the greatest measure in the numerical value interval earlier; Obtain the value of bit sequence in last the bit section that can not divide again, be worth pairing numerical value interval, obtain the fixation mark and the fixed-length units of unique correspondence by judging this; With the fixation mark of the described pairing cycle labeling of bit section that marks off, described last bit section jointly as the described mark of determining, the bit figure place that byte under described is taken is as variable-length unit, with variable-length unit as the described long measure of determining.
Detailed process is, numerical value 65 belongs to two bytes when being expressed as binary sequence, take 8 bit positions altogether, and numerical value interval maximum in the numerical value interval is the third scope, and shared bit position is 6 bit positions.8 bit positions are divided according to 6 bit positions from a high position, can be marked off a bit section, promptly 100000, for the bit section that marks off, corresponding to cycle labeling, i.e. 00 in the tag field, after 6 bit positions of cycle labeling 00 expression, being right after is cycle labeling or fixation mark.For last bit section, have two 0,1 two bit, corresponding to the numerical value in the decimal system 1, need to judge that this is worth pairing numerical value interval, can draw this numerical value 1 affiliated numerical value interval is 0-3, and therefore, the pairing fixation mark of last bit section is 01.Cycle labeling 00 and fixation mark 01 is common as the mark of determining, and 8 bit figure places that byte takies under the numerical value that first index is identified are as variable-length unit, with this variable-length unit as the long measure of determining.
Step S203: first index sign numerical value of representing with the described long measure of determining identifies as second index;
First index sign numerical value of representing by the long measure of determining in step S202 identifies as second index, be specially, first index of 4 the bit bit representations of the long measure of determining in first example sign numerical value 1100 is identified as second index; For second example, with first index sign numerical value 010000001 of 8 bit bit representations of long measure of determining.
Step S204: described first index sign of the common replacement of mark of using described second index sign and determining.
In above-mentioned step, determined second index sign and mark, in this step, replace first index sign.For above-described first example, during replacement, before second index sign, add the mark of determining, second index sign behind the interpolation mark is identified as first index.The mark 10 that is about to determine obtains sequence 101100 before adding second index sign 1100 to, after replacing first index and identify with this sequence, can reduce 26 bit.
For above-described second example, during replacement, with described second index sign, mark off the bit section according to the interval shared bit figure place of the greatest measure in the numerical value interval, before marking off the bit section, add cycle labeling, in the end add the fixation mark of determining before the bit section that can not divide, second index sign of having added fixation mark and cycle labeling is replaced described first index sign.The concrete replacement process of second example is divided the bit section with second index sign 01000001 according to the interval shared 6bit position of greatest measure in the numerical value interval, mark off a bit section 010000, with last bit section 01, for adding determined cycle labeling 00 before the bit section that marks off, for last bit section, add fixation mark 10, get second index sign 000100001001 of replacing out to the end.Can reduce 20 bit after replacing first index sign with this sequence.
So far, the process of index identification data compression finishes.By the compression process of top embodiment one, can draw first index sign for the 32bit position, the integer of numerical range between 0-3 when first index sign only needs 4bit to encode, and saves 28bit; The integer of numerical range between 4-15 when first index sign only needs 6bit to encode, and saves 26bit; The integer of numerical range between 16-63 when first index sign only needs 8bit to encode, and saves 24bit; Integer between 24 powers of numerical range at 63-2 of first index sign can be encoded with being not more than 32bit.Certainly, adopt just example such as each the concrete numerical value interval divided and the index sign that adopts the 32bit position in the foregoing description, be not limited to these concrete data values.Compression effectiveness when the numerical value that above-mentioned compressed index identification procedure identifies for index is small integer will be much better than the compression effectiveness of big integer.Before compression, the index sign of big integer can be converted to the difference of small integer.For example, compressed number is 16, can be converted to 2,4,4,6 with 16 compresses respectively by method of the present invention, during recovery, number behind the decompress(ion) is being carried out addition, thus several 16 before obtaining compressing, because after converting small integer to, can access bigger ratio of compression, thereby improve compression effects.
For index data, just index identifies that this is a kind of, also has the index position data, and for the index position data, the form of performance is a kind of incessantly.Index position data among the present invention are the sections that document are divided into predetermined quantity, as 32 sections, search key in the section of predetermined quantity, be searching character, and identify whether there is the character that will retrieve in each section, with the sign that obtains after the retrieval as the index position data.Process below by embodiment two explanations compressed index position data of the present invention.
In an embodiment of the present invention, section with predetermined quantity is that 32 sections describe, in the process of each section retrieval, if have character or the morpheme that to retrieve in this section, the character that occurs or the number of morpheme can be between 3-20 or other predetermined value, and then this section corresponding identification is set to 1; If do not occur in this section, then this section corresponding identification is set to 0.Can obtain an identifier thus, i.e. the index position data.With the index position data definition that will compress is the first index position data, is the second index position data with the index position data definition after the compression.For concrete compression process, referring to Fig. 3,
Step S301: the section quantity that obtains to exist described character by the first index position data;
After obtaining the first index position data, can obtain existing the section quantity of the searching character of wanting by the sign in the first index position data.
Step S302: judge the affiliated quantitative range of section quantity of described acquisition, determine mark and long measure by the described quantitative range of judging;
Realizing the process of compression, is to remove nonsensical bit position or use shorter bit sequence to represent the bit sequence.Therefore, in embodiment two, mark off a plurality of different scopes according to section quantity.The section quantity that will obtain in step S301 is judged by following table 2.
Coded system Implication is described Example
" 0 "+5bit Represent when this morpheme with 1 bit (0) only in 32 sections, to occur in certain 1 section, store with 5 bit and appear at concrete segment number Coding " 000010 " expression, this morpheme only appear in the section 2 and occur
" 10 "+10bit Represent when this morpheme with 2 bit (10) only in 32 sections, to occur in certain 2 section, organize with two 5bit and store this morpheme and specifically appear at any two segment numbers Coding " 100001000001 " expression, this morpheme only appear in section 2 and the section 1 and occur
" 11 "+4bit+ { 8bit tuple } * N When this morpheme only occurs in more than 2 sections in 32 sections. Coding " 111,000 10010100 " represents that this morpheme appears at the 1st in the first section, occur in 4,6 away minor segment
Table 2
In conjunction with above-mentioned table 2, scope under the section quantity that obtains is divided into two kinds of situations, first kind of situation is only to obtain a section or two sections, second kind of situation is situation about obtaining more than two sections.It is quantity interval in the affiliated scope that first kind of situation is defined as, and second kind of situation is the uncertainty quantity interval in the affiliated scope.Obtain the affiliated scope of section quantity by judgement, thereby determine mark and long measure by affiliated scope.
In this embodiment, two kinds of situations for top affiliated scope illustrate the process of determining mark and long measure respectively by two examples respectively.First example is to belong to first kind of situation, as the first index position data that obtain to be high 30 be 0, low 2 is 10, then can obtain to exist a section, the quantity interval of a section under belonging in the scope, can draw by table 2, be 0+5bit in the pairing coded system of section, wherein, 0 is a fixation mark, 5bit is a fixed-length units, then can be with fixation mark as the mark of determining, with fixed-length units as the long measure of determining.
Second example is to belong to second kind of situation, as the first index position data that obtain are that most-significant byte is 10010100, and low 24 is 0, then can obtain to exist 3 sections, promptly the 1st, 4,6 sections greater than two sections, belong to uncertainty quantity section in the affiliated scope.At this moment, need be with the section packets of predetermined quantity, the group echo that whether has the character that is retrieved in the section respectively organized in record, if exist, then group echo is recorded as 1; If there is no, then group echo is recorded as 0.In this example, 32 sections are divided into 4 groups from a high position to low level order (low level to a high position also can), like this, most-significant byte belongs to first group, thereby obtain corresponding group echo is 1000, because section quantity belongs to uncertainty quantity interval, the fixation mark of the interval unique correspondence of this uncertainty quantity is 11.With the fixation mark of the interval unique correspondence of uncertainty quantity and the group echo that obtains jointly as the mark of determining; Be about to fixation mark 11 and group echo 1000 as the mark of determining.To exist bit position that the place group of searching character section takies as the long measure of determining again.In this example, owing to have only one to have the group that comprises the character section that is retrieved.Therefore, the N*{8bit tuple that obtains at last } in, the number of N is 1, and expression has only a 8bit tuple, and 8 bit positions (8bit tuple) that this group is shared are as the long measure of determining.
Step S303: represent to exist in the first index position data section of described character according to the described long measure of determining, with the result after the expression as the second index position data;
In last step, first example is to represent to exist the section of character with 5 bit positions of long measure of obtaining, owing to be second section, therefore, the result of expression is 00010, with this result as the second index position data.Second example is as long measure with 8 bit positions.Therefore, the result of expression is 10010100, with this result as the second index position data.
Step S304: the described first index position data of the common replacement of mark of using the described second index position data and determining.
By the mark and the second index position data of determining in the preceding step, replace the first index position data.For first example, replacement is to add determined fixation mark before the second index position data, and the second index position data of adding fixation mark are replaced the first index position data.Promptly before the second index position data 00010, add the mark 0 determine, thereby obtain sequence 000010, this sequence is replaced the first index position data after, reduce 26bit.For second example, replacement is to add the group echo of determining before the second index position data, adds the fixation mark of determining before group echo, and the second index position data of adding group echo, fixation mark are replaced the first index position data.I.e. interpolation group echo 1000, fixation mark 11 before second index sign 10010100, thus obtain sequence 11100010010100, this sequence is replaced the first index position data after, reduce 18bit.
So far, the index position data compression is finished.By top compression process, if character only hits 1 section, then the index position data of original 32bit will be compressed into 6bit.If character only hits 2 sections, then the index position data of original 32bit will be compressed into 12bit.Hit more than 2 sections for character, then the index position data of original 32bit will be compressed into 6+8*n bit.The poorest situation when character all occurs at 32 sections, then is compressed into 38bit.Certainly, adopt just example such as each the concrete quantity interval divided and the index position data that adopt the 32bit position in the foregoing description, be not limited to these concrete data values.
The present invention also provides the device of a kind of index sign compression, below by embodiment three explanations, referring to Fig. 4, comprising:
Reading unit 401 is used to read first index sign;
Judging unit 402 is used to the affiliated scope of numerical value of judging that described first index identifies;
Arithmetic element 403 is used for determining mark and long measure by the described affiliated scope of judging;
Converting unit 404, the numerical value that first index that is used for that the described long measure of determining is represented identifies identifies as second index;
Replace unit 405, be used to use described second index sign, reach described first index sign of the common replacement of the described mark of determining.
Wherein, described judging unit 402 judges that the process of the affiliated scope of the numerical value that described first index identifies comprises:
Numerical value according to the bit sequence of first index sign is judged, judges that the numerical value of described first index sign belongs to corresponding numerical value interval or uncertainty numerical value interval.
Wherein, described arithmetic element 403 comprises 406, the second operator unit 407, the first operator unit,
Belong to corresponding numerical value interval if judge the numerical value of described first index sign when judging unit 402, then arithmetic element 403 process of determining mark and long measure by the described affiliated scope of judging is finished by the first operator unit 406;
The first operator unit 406 obtains the fixation mark and the fixed-length units of the interval unique correspondence of described numerical value, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
If judging unit 402 judges the numerical value of described first index sign and belong to uncertainty numerical value interval, then arithmetic element 403 process of determining mark and long measure by the described affiliated scope of judging is finished by the second operator unit 407; The second operator unit 407 comprises and divides module 408 that transceiver module 409 merges module 410;
Divide module 408, the bit figure place that byte takies under the numerical value that is used for described first index is identified marks off the bit section according to the interval shared bit figure place of the greatest measure in the numerical value interval;
Transceiver module 409 is used for obtaining the value of last the bit section bit sequence that can not divide again, should the pairing numerical value of value interval by the first operator unit judges, obtain the fixation mark and the fixed-length units of unique correspondence;
Merge module 410, be used for fixation mark with the described pairing cycle labeling of bit section that marks off, described last bit section jointly as the described mark of determining, the bit figure place that byte under described is taken is as variable-length unit, with variable-length unit as the described long measure of determining.
Replace unit 405, comprise that first replaces module 411, the second replacement modules 412.Second index of changing out by first arithmetic element 406 for converting unit 404 identifies, replace that unit 405 uses described second index sign, and the mark determined of the first operator unit 406 is common when replacing described first index sign, replaces module 411 by first and realizes.
First replaces module 411, is used for adding the described mark of determining before second index sign, and the sign of second index behind the described interpolation mark is identified as described first index.
Second index of changing out by second arithmetic element 407 for converting unit 404 identifies, and replacement unit described second index of 405 uses identifies, reaches the mark of determining the second operator unit 407 and passes through 412 realizations of the second replacement module when described first index of replacement identifies jointly.
Second replaces module 412, be used for described second index sign, mark off the bit section according to the interval shared bit figure place of the greatest measure in the numerical value interval, before marking off the bit section, add cycle labeling, in the end add the fixation mark of determining before the bit section that can not divide, second index sign of having added fixation mark and cycle labeling is replaced described first index sign.
The present invention also provides a kind of device of index position data compression, below by embodiment four explanations, referring to Fig. 5, comprising:
The first index position data cell 501, be used for document is divided into the section of predetermined quantity, searching character in the section of described predetermined quantity, and identify whether there is described character in each section, with the described sign that obtains after the retrieval as the first index position data;
Judging unit 502 is used for obtaining to exist by the first index position data section quantity of described character, judges the quantitative range under the section quantity of described acquisition;
Arithmetic element 503 is used for determining mark and long measure by the described quantitative range of judging;
The second index position data cell 504 is used for representing that according to the described long measure of determining there is the section of described character in the first index position data, with the expression after the result as the second index position data;
Replace unit 505, the described first index position data of the common replacement of mark that are used to use the described second index position data and determine.
Wherein, described judging unit 502 judges that the process of the affiliated quantitative range of section quantity of described acquisition comprises:
The section quantity of judging described acquisition belongs to corresponding quantity interval or uncertainty quantity interval.
Wherein, arithmetic element 503 comprises 506, the second operator unit 507, the first operator unit,
Belong to corresponding quantity interval if judge the section quantity of described acquisition when judging unit 502, then arithmetic element 503 process of determining mark and long measure by the described quantitative range of judging is finished by the first operator unit 506.
The first operator unit 506 is used to obtain the fixation mark and the fixed-length units of the interval unique correspondence of described quantity, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
Belong to uncertainty quantity interval if judge the section quantity of described acquisition when judging unit 502, then arithmetic element 503 process of determining mark and long measure by the described quantitative range of judging is finished by the second operator unit 507.
The second operator unit 507 comprises grouping module 508, merges module 509,
Grouping module 508 is used for the section packets with described predetermined quantity, whether has the group echo of described character in each group of record;
Merge module 509, be used for the fixation mark of the interval unique correspondence of described uncertainty quantity and described group echo jointly as the described mark of determining; The section place group that has described character is taken the bit position as the described long measure of determining.
Wherein, replace unit 505, comprise that first replaces module 510, the second replacement modules 511,
For the second index position data that the second index position data cell 504 obtains by the first operator unit 506, replace unit 505 and finish by the first replacement module 510 in the process that realizes the replacement first index position data,
First replaces module 510, is used for adding the 506 determined fixation marks of the first operator unit before the described second index position data, and the second index position data of adding fixation mark are replaced the first index position data.
For the second index position data that the second index position data cell 504 obtains by the second operator unit 507, replace unit 505 and finish by the second replacement module 511 in the process that realizes the replacement first index position data,
Second replaces module 511, be used for before the described second index position data, adding the second operator unit, 507 determined group echos, before group echo, add described definite fixation mark, the second index position data of adding group echo, fixation mark are replaced the first index position data.
For the method and apparatus of being set forth among each embodiment of the present invention, within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (16)

1, the method for a kind of index sign compression is characterized in that, comprising:
Read first index sign, judge the affiliated scope of the numerical value of described first index sign, determine mark and long measure by the described affiliated scope of judging;
First index sign numerical value of representing with the described long measure of determining identifies as second index;
Described first index sign of the common replacement of mark of using described second index sign and determining.
2, the method for a kind of index sign according to claim 1 compression is characterized in that,
Described first index sign, second index identify and are labeled as binary bit sequence.
3, the method for a kind of index compression according to claim 2 is characterized in that,
The process of judging the affiliated scope of the numerical value that described first index identifies comprises:
Numerical value according to the bit sequence of first index sign is judged, judges that the numerical value of described first index sign belongs to corresponding numerical value interval or uncertainty numerical value interval.
4, the method for a kind of index sign according to claim 3 compression is characterized in that,
Belong to corresponding numerical value interval if judge the numerical value of described first index sign, then the process of determining mark and long measure by the described affiliated scope of judging comprises:
Obtain the fixation mark and the fixed-length units of the interval unique correspondence of described numerical value, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
5, the method for a kind of index sign according to claim 4 compression is characterized in that, replaces described first index sign jointly and comprises:
Before second index sign, add the described mark of determining, the sign of second index behind the described interpolation mark is identified as described first index.
6, the method for a kind of index sign according to claim 4 compression is characterized in that,
Belong to uncertainty numerical value interval if judge the numerical value of described first index sign, then the process of determining mark and long measure by the described affiliated scope of judging comprises:
Number of bits with byte under the numerical value of described first index sign takies marks off bit section according to the interval shared number of bits of the greatest measure in the numerical value interval;
Obtain the value of bit sequence in last bit section that can not divide again, be worth pairing numerical value interval, obtain the fixation mark and the fixed-length units of unique correspondence by judging this;
With the fixation mark of the pairing cycle labeling of the described bit section that marks off, described last bit section jointly as the described mark of determining, the number of bits that byte under described is taken is as variable-length unit, with variable-length unit as the described long measure of determining.
7, the method for a kind of index sign according to claim 6 compression is characterized in that, described first index sign of described common replacement comprises:
With described second index sign, mark off bit section according to the interval shared number of bits of the greatest measure in the numerical value interval, before marking off bit section, add cycle labeling, in the end add the fixation mark of determining before the bit section that can not divide, second index sign of having added fixation mark and cycle labeling is replaced described first index sign.
8, the device of a kind of index sign compression is characterized in that, comprising:
Reading unit is used to read first index sign;
Judging unit is used to the affiliated scope of numerical value of judging that described first index identifies;
Arithmetic element is used for determining mark and long measure by the described affiliated scope of judging;
Converting unit, the numerical value that first index that is used for that the described long measure of determining is represented identifies identifies as second index;
Replace the unit, be used to use described second index sign, reach described first index sign of the common replacement of the described mark of determining.
9, a kind of method of index position data compression is characterized in that, comprising:
Document is divided into the section of predetermined quantity, searching character in the section of described predetermined quantity, and identify whether there is described character in each section, with the described sign that obtains after the retrieval as the first index position data;
Obtain to exist the section quantity of described character by the first index position data, judge the quantitative range under the section quantity of described acquisition, determine mark and long measure by the described quantitative range of judging;
Represent to exist in the first index position data section of described character according to the described long measure of determining, with the result after the expression as the second index position data;
The described first index position data of the common replacement of the mark that uses the described second index position data and determine.
10, the method for a kind of index position data compression according to claim 9 is characterized in that, described primary importance data, second place data, is labeled as binary bit sequence.
11, the method for a kind of index position data compression according to claim 10 is characterized in that, judges that the process of the affiliated quantitative range of section quantity of described acquisition comprises:
The section quantity of judging described acquisition belongs to corresponding quantity interval or uncertainty quantity interval.
12, the method for a kind of index position data compression according to claim 11, it is characterized in that, belong to corresponding quantity interval if judge the section quantity of described acquisition, the then described process of determining mark and long measure by the described quantitative range of judging comprises:
Obtain the fixation mark and the fixed-length units of the interval unique correspondence of described quantity, with the fixation mark of described unique correspondence of obtaining and fixed-length units as described mark of determining and long measure.
13, the method for a kind of index position data compression according to claim 12 is characterized in that, the process of the described common replacement first index position data comprises:
Before the described second index position data, add determined fixation mark, the second index position data of adding fixation mark are replaced the first index position data.
14, the method for a kind of index position data compression according to claim 11, it is characterized in that, belong to uncertainty quantity interval if judge the section quantity of described acquisition, the then described process of determining mark and long measure by the described quantitative range of judging comprises:
With the section packets of described predetermined quantity, whether there is the group echo of described character in each group of record;
With the fixation mark of the interval unique correspondence of described uncertainty quantity and described group echo jointly as the described mark of determining;
The section place group that has described character is taken bit as the described long measure of determining.
15, the method for a kind of index position data compression according to claim 14 is characterized in that, the process of the described common replacement first index position data comprises:
Before the described second index position data, add determined group echo, before group echo, add described definite fixation mark, the second index position data of adding group echo, fixation mark are replaced the first index position data.
16, a kind of device of index position data compression is characterized in that, comprising:
The first index position data cell, be used for document is divided into the section of predetermined quantity, searching character in the section of described predetermined quantity, and identify whether there is described character in each section, with the described sign that obtains after the retrieval as the first index position data;
Judging unit is used for obtaining to exist by the first index position data section quantity of described character, judges the quantitative range under the section quantity of described acquisition;
Arithmetic element is used for determining mark and long measure by the described quantitative range of judging;
The second index position data cell is used for representing that according to the described long measure of determining there is the section of described character in the first index position data, with the expression after the result as the second index position data;
Replace the unit, the described first index position data of the common replacement of mark that are used to use the described second index position data and determine.
CNB2007101108507A 2007-06-12 2007-06-12 Method and device for compressing index Active CN100498794C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101108507A CN100498794C (en) 2007-06-12 2007-06-12 Method and device for compressing index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101108507A CN100498794C (en) 2007-06-12 2007-06-12 Method and device for compressing index

Publications (2)

Publication Number Publication Date
CN101075261A true CN101075261A (en) 2007-11-21
CN100498794C CN100498794C (en) 2009-06-10

Family

ID=38976312

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101108507A Active CN100498794C (en) 2007-06-12 2007-06-12 Method and device for compressing index

Country Status (1)

Country Link
CN (1) CN100498794C (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104579358A (en) * 2015-01-20 2015-04-29 华北电力大学 Fault recording data compression method
CN105138528A (en) * 2014-06-09 2015-12-09 腾讯科技(深圳)有限公司 Multi-value data storage and reading method and apparatus and multi-value data access system
CN106156169A (en) * 2015-04-16 2016-11-23 深圳市腾讯计算机系统有限公司 The treating method and apparatus of discrete data
CN107846224A (en) * 2016-09-20 2018-03-27 天脉聚源(北京)科技有限公司 A kind of method and system that coding is compressed to ID marks
CN108932738A (en) * 2018-07-03 2018-12-04 南开大学 A kind of bit slice index compression method based on dictionary
CN110266571A (en) * 2019-06-17 2019-09-20 珠海格力电器股份有限公司 Improve the method, apparatus and computer equipment of CAN bus data transmission credibility
CN112241005A (en) * 2019-07-19 2021-01-19 杭州海康威视数字技术股份有限公司 Method and device for compressing radar detection data and storage medium
CN112241005B (en) * 2019-07-19 2024-05-31 杭州海康威视数字技术股份有限公司 Compression method, device and storage medium of radar detection data

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138528A (en) * 2014-06-09 2015-12-09 腾讯科技(深圳)有限公司 Multi-value data storage and reading method and apparatus and multi-value data access system
CN105138528B (en) * 2014-06-09 2020-03-17 腾讯科技(深圳)有限公司 Method and device for storing and reading multi-value data and access system thereof
CN104579358B (en) * 2015-01-20 2018-09-25 华北电力大学 A kind of fault recorder data compression method
CN104579358A (en) * 2015-01-20 2015-04-29 华北电力大学 Fault recording data compression method
CN106156169A (en) * 2015-04-16 2016-11-23 深圳市腾讯计算机系统有限公司 The treating method and apparatus of discrete data
CN106156169B (en) * 2015-04-16 2019-12-06 深圳市腾讯计算机系统有限公司 Discrete data processing method and device
CN107846224A (en) * 2016-09-20 2018-03-27 天脉聚源(北京)科技有限公司 A kind of method and system that coding is compressed to ID marks
CN108932738B (en) * 2018-07-03 2022-08-16 南开大学 Bit slice index compression method based on dictionary
CN108932738A (en) * 2018-07-03 2018-12-04 南开大学 A kind of bit slice index compression method based on dictionary
CN110266571A (en) * 2019-06-17 2019-09-20 珠海格力电器股份有限公司 Improve the method, apparatus and computer equipment of CAN bus data transmission credibility
CN110266571B (en) * 2019-06-17 2020-11-03 珠海格力电器股份有限公司 Method and device for improving reliability of CAN bus data transmission and computer equipment
CN112241005A (en) * 2019-07-19 2021-01-19 杭州海康威视数字技术股份有限公司 Method and device for compressing radar detection data and storage medium
CN112241005B (en) * 2019-07-19 2024-05-31 杭州海康威视数字技术股份有限公司 Compression method, device and storage medium of radar detection data

Also Published As

Publication number Publication date
CN100498794C (en) 2009-06-10

Similar Documents

Publication Publication Date Title
CN101075261A (en) Method and device for compressing index
CN1288581C (en) Document retrieval by minus size index
US8791843B2 (en) Optimized bitstream encoding for compression
US8914380B2 (en) Search index format optimizations
CN1593011A (en) Method and apparatus for adaptive data compression
CN1193779A (en) Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language
CN101036143A (en) Multi-stage query processing system and method for use with tokenspace repository
CN1786962A (en) Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN1761958A (en) Method and arrangement for searching for strings
CN1369970A (en) Position adaptive coding method using prefix prediction
CN1229944A (en) System and method for reducing footprint of preloaded classes
CN101030267A (en) Automatic question-answering method and system
CN1868127A (en) Data compression system and method
CN101055588A (en) Method for catching limit word information, optimizing output and input method system
CN1831825A (en) Document management method and apparatus and document search method and apparatus
CN1703089A (en) A two-value arithmetic coding method of digital signal
CN1510595A (en) Dictionary updating system, updating processing servo, terminal, controlling method, program, recording medium
CN101051845A (en) Huffman decoding method for quick extracting bit stream
CN1194321C (en) High-speed information search system
CN1949221A (en) Method and system of storing element and method and system of searching element
CN1714513A (en) Addresses generation for interleavers in TURBO encoders and decoders
CN1267963A (en) Data compression equipment and data restorer
CN1834957A (en) Multi-chart information initializing method of database
CN1951017A (en) Method and apparatus for sequence data compression and decompression
CN101043353A (en) Process for improving data-handling efficiency of network management system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151218

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.